Reputation: 23
I've been looking into Python and using the record.linkage toolkit for address matching. I've found the string matching algorithms such as levenshtein are returning false matches for very common addresses. Ideally an address with one very unique word matching would be more highly scored than very common words, e.g. "12 Pellican street" and "12 Pellican road" is a better match than "20 Main Street" and "34 Main Street".
Is there a method for incorporating a weighted string matching, so that addresses with more unique words carrying more importance for matching?
Upvotes: 1
Views: 617
Reputation: 23
I've found using the qgram distance over the levenshtein distance take into consideration the frequency of the string in the dataset.
Upvotes: 0
Reputation: 16081
You can use fuzzywuzzy
Installation
pip install fuzzywuzzy
Usage:
In [1]: from fuzzywuzzy import fuzz
In [2]: fuzz.ratio('12 Pellican street', '12 Pellican road')
Out[2]: 76
In [3]: fuzz.ratio("20 Main Street","34 Main Street")
Out[3]: 86
So, for address matching, we can create a custom function like this. It will match separately for street number and street name and find the average of that.
def match_address(address_1, address_2):
st_no_1, rest_1 = address_1.split(maxsplit=1)
st_no_2, rest_2 = address_2.split(maxsplit=1)
return (fuzz.ratio(st_no_1, st_no_2) + fuzz.ratio(rest_1, rest_2))/2
Execution:
In [4]: match_address('12 Pellican street', '12 Pellican road')
Out[4]: 85.5
In [5]: match_address("20 Main Street","34 Main Street")
Out[5]: 50.0
Upvotes: 1