Unclaimed project

Are you a maintainer of libpostal? Claim this project to take control of your public changelog and roadmap.

Claim this project

Changelog

libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.

openvenues/libpostal·

4.8k467CMIT

addressaddress-parsercdedupingdeduplicationinternational+4

Last updated 6 days ago

Address root expansions: removes tokens that are ignorable such as "Ave", "Pl", "Road", etc. in street names so that something like "West 125th St" can potentially match "W 125". This also allows for exact comparison of apartment numbers where "Apt 2" and "#2" mean the same thing. Every address component uses certain dictionaries in libpostal to determine what is ignorable or not, and although the method is rule-based and deterministic, it can also identify the correct root tokens in many complex cases like "Avenue Rd", "Avenue E", "E St SE", "E Ctr St", etc. While many of the test cases used so far are for English, libpostal's dictionary structure also allows it to work relatively well around the world, e.g. matching Spanish street names where "Calle" might be included in a government data set but is rarely used colloquially or in self-reported addresses. For street names, we additionally strip whitespace from the root expansion so "Sea Grape Ln" and "Seagrape Ln" will both normalize to "seagrape".
Phonetic matching for names: the near-dupe hashes for venue/place/company names written in Latin script include a modified version of the double metaphone algorithm which can be useful for comparing misspelled human names, as well as comparing machine transliterations against human ones in languages where names might written in multiple scripts in different data sets e.g. Arabic or Japanese.
Geo qualifiers: for address data sets with lat/lons, geohash tiles (with a precision of 6 characters by default) and their 8 neighbors (to avoid faultlines) are used to narrow down the comparisons to addresses/places in a similar location. If there's no lat/lon, and the data are known to be from a single country, the postal code or the city name can optionally be used as the geo qualifier. Future improvements include disambiguating toponyms and mapping them to IDs in a hierarchy, such that multiple names for cities, etc. can resolve to one or more IDs, and e.g. an NYC address that uses a neighborhood name in place of the city e.g. "Harlem, NY" could match "New York, NY" by traversing the hierarchy and outputting the city's ID instead.
Acronym generation: when generating hash keys for names, we also try to guess acronyms when there are two or more tokens. While the acronym alignments downstream can be more exact, there are cases like "Brooklyn Academy of Music" vs "BAM" which would not match a single token using the previous methods, but are the same place nonetheless. Acronym generation attempts to solve this problem for venue names, human initials, etc. It is stopword-aware in various languages, but generates multiple potential forms of the acronym for cases like "Museum of Modern Art" vs. "MoMA" where the stopword is included in the acronym. The method also includes rudimentary support for sub-acronyms i.e. generating new tokens at every other non-contiguous stopword (so "University of Texas at Austin" will produce "UT"), as well as at punctuation marks like commas or colons (e.g. "University of Texas, Austin"). Acronym generation also leaves intact any known dictionary phrases that resolve to multi-word phrases so e.g. "Foo High School" and "Foo HS" can share a hash key, and also keeps any acronyms identified as part of the tokenizer from the internal period structure like "A.B.C." All of this also works across scripts for any non-ideographic languages. Though this method will not cover all valid acronyms directly, we use a double metaphone on the output in Latin scripts so with certain types of more complex acronyms like "sonar" vs. "Sound Navigation and Ranging", where more than the initial character of each word is used, the generated acronym may still be phonetically the same as the real acronym (in this case, both resolve to "SNR"), especially if the other letters used are vowels.
Name quad-grams: in order to handle a variety of spelling differences, as well as languages with longer concatenated words as in German, all of the name hashes mentioned above are converted in to unique sequences of 4 characters. Simple numeric tokens consisting of digits and hyphens, etc. are included as-is, without the quadgrams or double metaphone encoding. For the words that do use quadgrams, there is no token boundary disambiguation i.e. the double metaphone for "Nationalgalerie" would be "NXNLKLR" which would then generate strings ["NXNL", "XNLK", "NLKL", "LKLR"] (as opposed to the behavior of our languages classifier features which would generate ["NXNL_", "_XNLK_", "_NLKL_", "_LKLR"], using the underscore to indicate whether the 4-gram is at the beginning, middle, or end of the word). Since the fully concatenated name without whitespace is also double metaphone encoded and hashed to 4-grams in this fashion, it should also account for variance in the phonetic output resulting from spacing/concatenation differences (vowels at the beginning of a word are usually converted to "A" in double metaphone, whereas they're ignored in the middle of a word). Overall, though this method does slightly reduce the precision of near-dupe hashing (causes us to identify more potential dupes and thus make more pairwise comparisons in the next stage), it also increases recall of the process (we don't miss as many true dupes).

Jaro-Winkler similarity: this is a string similarity metric developed for comparing names in the U.S. Census. It detects small spelling differences in words based on the number of matches and transpositions relative to the lengths of the two strings. The Winkler variant gives a more favorable score to words with a shared common prefix. This is the local similarity metric used in most of the Soft-TFIDF literature, and we use the commonly-cited value of 0.9 for the inclusion threshold, which works reasonably well in practice. Note: all of our string similarity methods use unicode characters rather than bytes in their calculations.
Damerau-Levenshtein distance: the traditional edit distance metric, where transpositions of two characters count as a single edit. If a string does not meet the Jaro-Winkler threshold, but has a maximum edit distance of 1 (could be that the first character was transposed), and a minimum length of 4 (many short strings are within edit distance 1 of each other, so don't want to generate too many false positives). Note: since distances and similarities are not on the same scale, we use the Damerau-Levenshtein only as a qualifying threshold, and use the Jaro-Winkler similarity value (even though it did not meet the threshold) for the qualifying pair in the final similarity calculation.
Sequence alignment with affine gap penalty and edit operation subtotals: a new, efficient method for sequence alignment and abbreviation detection. This builds on the Smith-Waterman-Gotoh algorithm with affine gap penalties, which was originally used for alignment of DNA sequences, but works well for other types of text. When we see a rare abbreviation that's not in the libpostal dictionaries, say "Service" and "Svc", the alignment would be "S--v-c-". In other words, we match "S", open a gap, extend that gap for two characters, then match "v", open another gap, extend it one character, match "c", open a gap, and extend it one more character at the end. In the original Smith-Waterman, O(mn) time and space was required to compute this alignment (where m is the length of the first string and n is the length of the second). Gotoh's improvement still needs O(mn) time and O(m) space (where m is the length of the longer string), but it does not store the sequence of operations, only a single cost where each type of edit pays a particular penalty, where the affine gap penalty is the idea that we should pay more for opening a gap than extending it. The problem with the single cost is it's not always clear what to make of that single combined score. The new method we use in libpostal stores and returns a breakdown of the counts and specific types of edits it makes (matches, mismatches, gap opens, gap extensions, and transpositions) rather than rolling them up into a single cost, and without needing to return or compute the full alignment as in Needleman-Wunsch or Hirschberg's variant. Using this method we know that for "Service" and "Svc", the number of matches is equal to the length of the shorter string, regardless of how many gaps were opened, and the two share a common prefix of "S", so "Svc" can be considered a possible abbreviation for "Service". When we find one of these possible abbreviations, and none of the other thresholds are met (which can easily happen with abbreviations), it qualifies both tokens for inclusion in the final similarity, again using their Jaro-Winkler similarity as the weight in the final calculation. For strict abbreviations (match the criteria for possible abbreviations and also share a common suffix e.g. "Festival" and "Fstvl") that are greater than a certain length, an optional static score is used (if it's higher than the Jaro-Winkler score), so that cases that really look like abbreviations can be weighted as more similar than the Jaro-Winkler score might indicate. We also use the square of the larger of the two term scores in place of their product, since we're effectively considering these words an exact match, and we calculate an appropriate offset for the lower-scoring vector's L2 norm to compensate.
Acronym alignments: especially prevalent in universities, non-profits, museums, government agencies, etc. We provide a language-based stopword-aware acronym alignment method which can match "Museum of Modern Art" to "moma" (no capitalization needed), "University of California Berkeley" to "UC Berkeley", etc. If tokens in the shorter string are an acronym for tokens in the longer string, all of the above are included in the similarity score with a 1.0 local similarity (so those tokens' scores will be counted as evidence for a match, not against it). For cases like "AB" vs. "A B" where the scores may be significantly different between the combined version and the separate single letters version, we take the greater of a) the acronym score in the shorter string or b) the L2 norm of the individual words for the longer string, and again use an offset for the norm of the lower-scoring vector.
Known-phrase alignments: similar to acronym alignment, libpostal's dictionaries are used at deduping time as well, and it two strings contain known phrases with the same canonical, those phrase spans are given the maximum similarity (product of L2 norms of both).
Multi-word phrase alignments for spacing variations: again similar to acronym alignment, but for handling spacing variations like "de la Cruz" vs. "dela Cruz" or "Sea Grape Dr" vs. "Seagrape Dr" as exact matches. In a token-based approach, the first word in a phrase might match on Jaro-Winkler similarity alone, but the subsequent words in the string with spaces would count against the similarity value. Having multiple words align to their concatenated form pays no penalty for the whitespace.
Match demotion for differing sets of initials: depending on the term weighting scheme used and the type of names in the corpus, the weights for single letters (found in human initials) may have a low enough weight that they don't affect the similarity much. This is especially true for information gain, where a single letter like "A" or "B" may cooccur with many different names and end up conveying no more information than a common stopword like "the" or "of". When this is true, "A & B Jewelry" vs. "B & C Jewelry" might generate a false positive because the scores for "A" and "C" are so low compared to "Jewelry" that they are ignored. In cases where there's a likely dupe and, given sets of single letter tokens S1 and S2 in the two strings respectively, we demote the match to needs_review if there is a symmetric difference and both sides (S1 - S2 and S2 - S1) are non-empty. Note: the typical case of human names with/without a middle initial will still match under this heuristic i.e. "Yvette Clarke" matches "Yvette D Clarke" because the initial only exists in one string and there is no conflict. However, something like "J Dilla" vs. "K Dilla" would be classified as needing human review.