Rajiv A
Rajiv A

Reputation: 943

Compare two addresses which are not in standard format

I have to compare addresses from two tables and get the Id if the address matches. Each table has three columns Houseno, street, state The address are not in standard format in either of the tables. There are approx. 50,000 rows, I need to scan through

At some places its Ave. Avenue Ave . Str Street, ST. Lane Ln. Place PL Cir CIRCLE. Any combination with a dot or comma or spaces ,hypen. I was thinking of combining all three What can be best way to do it in SQL or PLSQL for example

table1

     HNO         STR          State
     -----       -----         ----- 
      12        6th Ave         NY
      10        3rd Aven        SD
      12-11     Fouth St        NJ                         
      11        sixth Lane      NY
      A23       Main Parkway    NY
      A-21      124 th Str.     VA

table2

      id   HNO         STR          state
     --    -----       -----         ----- 
     1      12        6 Ave.         NY
     13     10        3 Avenue       SD
     15     1121      Fouth Street   NJ                         
     33     23        9th Lane       NY
     24     X23       Main Cir.      NY
     34     A1       124th Street    VA

Upvotes: 5

Views: 4142

Answers (4)

dotnetdev_2009
dotnetdev_2009

Reputation: 791

The way we did this for one of our applications was to use a third party adddress normalization API(eg:Pitney Bowes),normalize each address(Address is a combination of Street Address,City ,State and Zip) and create a T-sql hash for that address.For the adress to compare do the same thing and compare the two hashes and if they match,we have a match

Upvotes: 1

ChrisProsser
ChrisProsser

Reputation: 13108

Oracle has a built in package UTL_Match which has an edit_distance function (based on the Levenshtein algorithm, this is a measure of how many changes you would need to make to make one string the same as another). More info about this Package / Function can be found here: http://docs.oracle.com/cd/E18283_01/appdev.112/e16760/u_match.htm

You would need to make some decisions around whether to compare each column or concatenate and then compare and what a reasonable threshold is. For example, you may want to do a manual check on any with an edit distance of less than 8 on the concatenated values.

Let me know if you want any help with the syntax, the edit_distance function just takes 2 varchar2 args (the strings you want to compare) and returns a number.

This is not a perfect solution in that if you set the threshold high you will have a lot of manual checking to do to discard some, and if you set it too low you will miss some matches, but it may be about the best if you want a relatively simple solution.

Upvotes: 1

patrick
patrick

Reputation: 1

you can make a cursor where you do first a group by where house number and city =.

in a loop you can separate a row with instr e substr considering chr(32).

After that you can try to consider to make a confront with substring where you have a number 6 = 6th , other case street = str.

good luck!

Upvotes: 0

Aaron Digulla
Aaron Digulla

Reputation: 328760

There is no simple way to achieve what you want. There is a expensive software (google for "address standardization software") that can do this but rarely 100% automatic.

What this type of software does is to take the data, use complex heuristics to try to figure out the "official" address and then return that (sometimes with the confidence that the result is correct, sometimes a list of results sorted by confidence).

For a small percentage of the data, the software will simply not work and you'll have to fix that yourself.

Upvotes: 1

Related Questions