-
Jaro-Winkler similarity: this is a string similarity metric developed for comparing names in the U.S. Census. It detects small spelling differences in words based on the number of matches and transpositions relative to the lengths of the two strings. The Winkler variant gives a more favorable score to words with a shared common prefix. This is the local similarity metric used in most of the Soft-TFIDF literature, and we use the commonly-cited value of 0.9 for the inclusion threshold, which works reasonably well in practice. Note: all of our string similarity methods use unicode characters rather than bytes in their calculations.
-
Damerau-Levenshtein distance: the traditional edit distance metric, where transpositions of two characters count as a single edit. If a string does not meet the Jaro-Winkler threshold, but has a maximum edit distance of 1 (could be that the first character was transposed), and a minimum length of 4 (many short strings are within edit distance 1 of each other, so don't want to generate too many false positives). Note: since distances and similarities are not on the same scale, we use the Damerau-Levenshtein only as a qualifying threshold, and use the Jaro-Winkler similarity value (even though it did not meet the threshold) for the qualifying pair in the final similarity calculation.
-
Sequence alignment with affine gap penalty and edit operation subtotals: a new, efficient method for sequence alignment and abbreviation detection. This builds on the Smith-Waterman-Gotoh algorithm with affine gap penalties, which was originally used for alignment of DNA sequences, but works well for other types of text. When we see a rare abbreviation that's not in the libpostal dictionaries, say "Service" and "Svc", the alignment would be "S--v-c-". In other words, we match "S", open a gap, extend that gap for two characters, then match "v", open another gap, extend it one character, match "c", open a gap, and extend it one more character at the end. In the original Smith-Waterman, O(mn) time and space was required to compute this alignment (where m is the length of the first string and n is the length of the second). Gotoh's improvement still needs O(mn) time and O(m) space (where m is the length of the longer string), but it does not store the sequence of operations, only a single cost where each type of edit pays a particular penalty, where the affine gap penalty is the idea that we should pay more for opening a gap than extending it. The problem with the single cost is it's not always clear what to make of that single combined score. The new method we use in libpostal stores and returns a breakdown of the counts and specific types of edits it makes (matches, mismatches, gap opens, gap extensions, and transpositions) rather than rolling them up into a single cost, and without needing to return or compute the full alignment as in Needleman-Wunsch or Hirschberg's variant. Using this method we know that for "Service" and "Svc", the number of matches is equal to the length of the shorter string, regardless of how many gaps were opened, and the two share a common prefix of "S", so "Svc" can be considered a possible abbreviation for "Service". When we find one of these possible abbreviations, and none of the other thresholds are met (which can easily happen with abbreviations), it qualifies both tokens for inclusion in the final similarity, again using their Jaro-Winkler similarity as the weight in the final calculation. For strict abbreviations (match the criteria for possible abbreviations and also share a common suffix e.g. "Festival" and "Fstvl") that are greater than a certain length, an optional static score is used (if it's higher than the Jaro-Winkler score), so that cases that really look like abbreviations can be weighted as more similar than the Jaro-Winkler score might indicate. We also use the square of the larger of the two term scores in place of their product, since we're effectively considering these words an exact match, and we calculate an appropriate offset for the lower-scoring vector's L2 norm to compensate.
-
Acronym alignments: especially prevalent in universities, non-profits, museums, government agencies, etc. We provide a language-based stopword-aware acronym alignment method which can match "Museum of Modern Art" to "moma" (no capitalization needed), "University of California Berkeley" to "UC Berkeley", etc. If tokens in the shorter string are an acronym for tokens in the longer string, all of the above are included in the similarity score with a 1.0 local similarity (so those tokens' scores will be counted as evidence for a match, not against it). For cases like "AB" vs. "A B" where the scores may be significantly different between the combined version and the separate single letters version, we take the greater of a) the acronym score in the shorter string or b) the L2 norm of the individual words for the longer string, and again use an offset for the norm of the lower-scoring vector.
-
Known-phrase alignments: similar to acronym alignment, libpostal's dictionaries are used at deduping time as well, and it two strings contain known phrases with the same canonical, those phrase spans are given the maximum similarity (product of L2 norms of both).
-
Multi-word phrase alignments for spacing variations: again similar to acronym alignment, but for handling spacing variations like "de la Cruz" vs. "dela Cruz" or "Sea Grape Dr" vs. "Seagrape Dr" as exact matches. In a token-based approach, the first word in a phrase might match on Jaro-Winkler similarity alone, but the subsequent words in the string with spaces would count against the similarity value. Having multiple words align to their concatenated form pays no penalty for the whitespace.
-
Match demotion for differing sets of initials: depending on the term weighting scheme used and the type of names in the corpus, the weights for single letters (found in human initials) may have a low enough weight that they don't affect the similarity much. This is especially true for information gain, where a single letter like "A" or "B" may cooccur with many different names and end up conveying no more information than a common stopword like "the" or "of". When this is true, "A & B Jewelry" vs. "B & C Jewelry" might generate a false positive because the scores for "A" and "C" are so low compared to "Jewelry" that they are ignored. In cases where there's a likely dupe and, given sets of single letter tokens S1 and S2 in the two strings respectively, we demote the match to needs_review if there is a symmetric difference and both sides (S1 - S2 and S2 - S1) are non-empty. Note: the typical case of human names with/without a middle initial will still match under this heuristic i.e. "Yvette Clarke" matches "Yvette D Clarke" because the initial only exists in one string and there is no conflict. However, something like "J Dilla" vs. "K Dilla" would be classified as needing human review.