Reputation: 943
I have to compare addresses from two tables and get the Id if the address matches. Each table has three columns Houseno, street, state The address are not in standard format in either of the tables. There are approx. 50,000 rows, I need to scan through
At some places its Ave. Avenue Ave . Str Street, ST. Lane Ln. Place PL Cir CIRCLE. Any combination with a dot or comma or spaces ,hypen. I was thinking of combining all three What can be best way to do it in SQL or PLSQL for example
table1
HNO STR State
----- ----- -----
12 6th Ave NY
10 3rd Aven SD
12-11 Fouth St NJ
11 sixth Lane NY
A23 Main Parkway NY
A-21 124 th Str. VA
table2
id HNO STR state
-- ----- ----- -----
1 12 6 Ave. NY
13 10 3 Avenue SD
15 1121 Fouth Street NJ
33 23 9th Lane NY
24 X23 Main Cir. NY
34 A1 124th Street VA
Upvotes: 5
Views: 4142
Reputation: 791
The way we did this for one of our applications was to use a third party adddress normalization API(eg:Pitney Bowes),normalize each address(Address is a combination of Street Address,City ,State and Zip) and create a T-sql hash for that address.For the adress to compare do the same thing and compare the two hashes and if they match,we have a match
Upvotes: 1
Reputation: 13108
Oracle has a built in package UTL_Match which has an edit_distance function (based on the Levenshtein algorithm, this is a measure of how many changes you would need to make to make one string the same as another). More info about this Package / Function can be found here: http://docs.oracle.com/cd/E18283_01/appdev.112/e16760/u_match.htm
You would need to make some decisions around whether to compare each column or concatenate and then compare and what a reasonable threshold is. For example, you may want to do a manual check on any with an edit distance of less than 8 on the concatenated values.
Let me know if you want any help with the syntax, the edit_distance function just takes 2 varchar2 args (the strings you want to compare) and returns a number.
This is not a perfect solution in that if you set the threshold high you will have a lot of manual checking to do to discard some, and if you set it too low you will miss some matches, but it may be about the best if you want a relatively simple solution.
Upvotes: 1
Reputation: 1
you can make a cursor where you do first a group by where house number and city =.
in a loop you can separate a row with instr e substr considering chr(32).
After that you can try to consider to make a confront with substring where you have a number 6 = 6th , other case street = str.
good luck!
Upvotes: 0
Reputation: 328760
There is no simple way to achieve what you want. There is a expensive software (google for "address standardization software") that can do this but rarely 100% automatic.
What this type of software does is to take the data, use complex heuristics to try to figure out the "official" address and then return that (sometimes with the confidence that the result is correct, sometimes a list of results sorted by confidence).
For a small percentage of the data, the software will simply not work and you'll have to fix that yourself.
Upvotes: 1