Reputation: 1181
I need to compare two unstructured addresses and be able to identify if they are the same (or similar enough).
I know we can use some Fuzzy logic for this kind of comparison, with some tolerance for misspelling, but...
I do not want to reinvent the Wheel. This problem seems like a common concern in different contexts and I think there is an algorithm (with some slight modifications, maybe) that might be a fit for this scenario.
Thanks in advance
Upvotes: 4
Views: 3230
Reputation: 3249
I've helped build some open source tools to do this.
Basically, the approach is to try to split and address into it's constituent parts and then intelligently compare those parts.
Both parts of the problem are hard.
The first part is often called address parsing. Here's what we use: https://github.com/datamade/usaddress
The second part has many, many names but, let's call it fuzzy matching. Here's the library we made for that: https://github.com/datamade/dedupe
We also provided some facilities for using them together: http://dedupe.readthedocs.io/en/latest/Variable-definition.html#address-type
Upvotes: 5