Terence Chow
Terence Chow

Reputation: 11163

comparing similar words and phrases

I have 2 databases in excel. In database A, I have the names of various companies, cities, and charities. Database B is the same. However Database B is filled out by the customer. As such, I get a lot of random mistakes and/or variations on the legal name.

What is the best way to match the names?

Here are some sample differences:

  1. City of ABC might show up as Corporation of the City of ABC
  2. ABC Corporation might just up as ABCcorporation (They forgot a space)
  3. University of ABC may be abbreviated as Univ of ABC
  4. Canadian Tire might show up as Canadian Tire Store #503
  5. Canadian Tire might be spelt wrong like Canadia Tire
  6. ABC Corp might show up as ABC Inc

Is there a good solution to this? I know this question is a bit of a long shot, but if I can do this I will have saved people in my company like thousands of hours each year...

Any advice will be greatly appreciated

Upvotes: 0

Views: 350

Answers (2)

Miserable Variable
Miserable Variable

Reputation: 28761

This is a very complex problem. Look up "master data management" and "dedup". This wikipedia article is a good starting point.

The problem is best solved in small chunks. My recommendation is to read up a little and implement a tool that lists potential duplicates and some easy way to merge them. The keyword here is potential; you don't want to do wrong merges and false positives are very likely and very harmful.

Upvotes: 2

reign_man
reign_man

Reputation: 569

You could use regular expressions to filter these databases.

http://en.wikipedia.org/wiki/Regular_expression

http://www.zytrax.com/tech/web/regex.htm

You can have a program pattern match based on the relevant part of a company name. For example, If someone puts in Microsoft Corporation of Redmond, and your program pattern matches against 'Microsoft' you'd get a hit.

Upvotes: 0

Related Questions