Kenny
Kenny

Reputation: 1982

Hierarchical Fuzzy matching strategy for address matching

I am building an address matching module in R, where I would like to find a match of a list of inAddress against a database of all addresses dbAddress using R.

Let's say the address contains street number, street name, postal code, city to be matched. There are certain matching rules I would like to consider, for example :

Do you have any advise on the strategy and how to build it effectively ? Here's several of my thoughts so far :

I am concerned this will be a big performance hurt. Also, is there a way to speed up multiple address match at the same time ? Perhaps join on postal code first to avoid full search each time ? Parallelism ?

Any advice would be welcome. Thank you

Upvotes: 2

Views: 6473

Answers (1)

Strydom
Strydom

Reputation: 980

The levensthein is a must for simple spelling mistakes. Finding the right tolerance is important because less than 0.8 would return too many false positives.

I’d recommend using a dictionary of short words that you can correct too, such as road/raod or street/stret.

You may want to check for abbreviations such as Ave vs Avenue, which starts with the same characters however Road vs Rd is missing some characters so the matching rules are different. Once again, a dictionary could help.

This article contains 12 tests to find addresses using fuzzy matching that could be useful for improving your algorithm. Many of these examples Google can’t even match!

The examples include:

  1. Spelling Mistakes
  2. Missing Space
  3. Incorrect Type (Street vs Road)

  4. Bordering / Nearby Suburb

  5. Abbreviations
  6. Synonyms: Floor vs Level
  7. Unit, Flat or Apartment vs Letter
  8. Number vs Letter
  9. Extra Words (e.g. Front Door, Department Name)
  10. Swapped Letters
  11. Sounds Like
  12. Tokenisation (Different Input Order)

After looking at several commercial address autocomplete widgets, this one (https://www.addy.co.nz/address-finder-fuzzy-matching) is by far the smartest for New Zealand addresses. Perhaps you can get inspiration and come up with an even better algorithm!

Upvotes: 2

Related Questions