Dave Jarvis
Dave Jarvis

Reputation: 31201

Match similar zip codes

Background

To replace invalid zip codes.

Sample Data

Consider the following data set:

 Typo  | City       | ST | Zip5
-------+------------+----+------
 33967 | Fort Myers | FL | 33902
 33967 | Fort Myers | FL | 33965
 33967 | Fort Myers | FL | 33911
 33967 | Fort Myers | FL | 33901
 33967 | Fort Myers | FL | 33907
 33967 | Fort Myers | FL | 33994
 34115 |Marco Island| FL | 34145
 34115 |Marco Island| FL | 34146
 86405 |  Kingman   | FL | 86404
 86405 |  Kingman   | FL | 86406

33967 closely matches 33965, although 33907 could also be correct. (In this case, 33967 is a valid zip code, but not in our zip code database.)

34115 closely matches is 34145 (off by one digit, with a difference of 3 for that digit).

86405 closely matches both.

Sometimes digits are simply reversed (e.g,. 89 instead of 98).

Question

How would you write a SQL statement that finds the "minimum distance" between multiple numbers that have the same number of digits, returning at most one result no matter what?

Ideas

Conditions

PostgreSQL 8.3

Upvotes: 0

Views: 583

Answers (2)

ChrisLively
ChrisLively

Reputation: 88072

Redfilter answered the question that was asked, but I just wanted to clarify that the requested solution will not resolve what appears to be the real problem.

The real problem here seems to be that you have a database which was hand keyed and some numbers were transcribed giving garbage data.

The ONLY way to solve this problem is to validate the full address against a database like the USPS, MapQuest, or another provider. I know the first two have API's available for doing this.

The example I gave in a comment above was to consider a zip of 75084 and a city value of Richardson. Richardson has zip codes in the range of 75080, 81, 82, 83, and 85. The minimum number of edits will be 1. However, which one?

Another equal problem is what if the entered zip code was 75083 for Richardson. Which is a valid zipcode for that city; however, what if the address resided in 75082?

The only way to get that is to have the full address validated.

Upvotes: 2

D'Arcy Rittich
D'Arcy Rittich

Reputation: 171529

This sounds like a case for Levenshtein distance.

The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.

It looks like PostgreSQL has it built-in:

test=# SELECT levenshtein('GUMBO', 'GAMBOL');
 levenshtein
-------------
           2
(1 row)

http://www.postgresql.org/docs/8.3/static/fuzzystrmatch.html

Upvotes: 5

Related Questions