Reputation: 1
I have 2 tables of UK postal addresses (around 300000 rows each) and need to match one set to another in order to return a unique ID contained in first set for each address. The problem is there's a lot of variation in the formats of the addresses and in the spellings. I've written a lot of t-sql scripts to pick off the east matches (exact postcode + house number + street name, etc) but there are many unmatched records left that are proving difficult to handle. I might end up having as many sql scripts as there are exceptions! I've look at Levenstein function and ranking word for word but these methods are unreliable and problematic too.
Does anyone have any experience of doing similar work and what was your approach & success rate?
Thank you!
Upvotes: 0
Views: 264
Reputation: 2281
I agree with the commenters that this is largely a business rule thing rather than a programming question, but for what it's worth...
I had a somewhat similar problem with a catalogue many years ago. Entries weren't always consistent in the way we'd hoped, different editions came up weirdly and with a wide variety of variations. All had to be linked.
What I did in the end was a fuzzy matcher. Broke the item down into components. Normalised the data where I could - removing spaces from fields that didn't always have them and could live without them for example. Worked out the distance between near misses - bar and car being 1 apart, for example. I stemmed words - see http://snowball.tartarus.org/algorithms/english/stemmer.html for more info. Think I even played with SQL Server's SOUNDEX matching.
I then went through and scripted the job to produce a list of candidate matches. Anything above a certain level got presented to an administrator, who was shown what the program thought was the best match along with other likely matches. They picked the one that looked best, ticked it and went on to the next one.
At the start of the list everyone thought the job was far too huge to be manageable. They then started going through it, and found it was much faster than they thought and much easier than they'd feared to stay on top of the new data as it came in.
The script to do it all programmatically will never be perfect, and will end up being nearly as long as the source list with as many objections as it'll generate. Don't try to automate it perfectly; automate the easy stuff, put a human in the loop for the uncertain cases. Much easier and safer.
Upvotes: 1