Reputation: 1982
I am building an address matching module in R, where I would like to find a match of a list of inAddress
against a database of all addresses dbAddress
using R.
Let's say the address contains street number, street name, postal code, city
to be matched. There are certain matching rules I would like to consider, for example :
postal code should be an exact match
street number should be an exact match, unless not found, then consider fuzzy matching
Do you have any advise on the strategy and how to build it effectively ? Here's several of my thoughts so far :
if
and limit those with
postal match firstI am concerned this will be a big performance hurt. Also, is there a way to speed up multiple address match at the same time ? Perhaps join on postal code first to avoid full search each time ? Parallelism ?
Any advice would be welcome. Thank you
Upvotes: 2
Views: 6473
Reputation: 980
The levensthein is a must for simple spelling mistakes. Finding the right tolerance is important because less than 0.8 would return too many false positives.
I’d recommend using a dictionary of short words that you can correct too, such as road/raod or street/stret.
You may want to check for abbreviations such as Ave vs Avenue, which starts with the same characters however Road vs Rd is missing some characters so the matching rules are different. Once again, a dictionary could help.
This article contains 12 tests to find addresses using fuzzy matching that could be useful for improving your algorithm. Many of these examples Google can’t even match!
The examples include:
Incorrect Type (Street vs Road)
Bordering / Nearby Suburb
After looking at several commercial address autocomplete widgets, this one (https://www.addy.co.nz/address-finder-fuzzy-matching) is by far the smartest for New Zealand addresses. Perhaps you can get inspiration and come up with an even better algorithm!
Upvotes: 2