Max
Max

Reputation: 23

String matching algorithm with more weight given to more unique words?

I've been looking into Python and using the record.linkage toolkit for address matching. I've found the string matching algorithms such as levenshtein are returning false matches for very common addresses. Ideally an address with one very unique word matching would be more highly scored than very common words, e.g. "12 Pellican street" and "12 Pellican road" is a better match than "20 Main Street" and "34 Main Street".

Is there a method for incorporating a weighted string matching, so that addresses with more unique words carrying more importance for matching?

Upvotes: 1

Views: 617

Answers (2)

Max
Max

Reputation: 23

I've found using the qgram distance over the levenshtein distance take into consideration the frequency of the string in the dataset.

Upvotes: 0

Rahul K P
Rahul K P

Reputation: 16081

You can use fuzzywuzzy

Installation

pip install fuzzywuzzy

Usage:

In [1]: from fuzzywuzzy import fuzz

In [2]: fuzz.ratio('12 Pellican street', '12 Pellican road')
Out[2]: 76

In [3]: fuzz.ratio("20 Main Street","34 Main Street")
Out[3]: 86

So, for address matching, we can create a custom function like this. It will match separately for street number and street name and find the average of that.

def match_address(address_1, address_2):
    st_no_1, rest_1 = address_1.split(maxsplit=1)
    st_no_2, rest_2 = address_2.split(maxsplit=1)
    return (fuzz.ratio(st_no_1, st_no_2) + fuzz.ratio(rest_1, rest_2))/2

Execution:

In [4]: match_address('12 Pellican street', '12 Pellican road')
Out[4]: 85.5

In [5]: match_address("20 Main Street","34 Main Street")
Out[5]: 50.0

Upvotes: 1

Related Questions