Reputation: 53
I have searched the questions and there are similar questions asked but no solution that I think I can use. This question is similar to a fuzzy match...sorta. I need help with comparing two tables. One table is a company reference table and the other is a table that get's raw company data imported into it daily. The reference table is clean and has a company id associated with every single company. The daily data that is imported does not have a company id. What I'm trying to do is have the daily data reference the 'Company Reference Table' on company_name and update the 'Company Table's column company_state based on the company_name. Unfortunately, the daily data string for company_name coming in is not always the same each day. There can be various characters (a-z, 0-9, +, -, .) and spaces in front or after the actual company name with varying lengths daily so I do not believe I can use charindex to clean it up.
Company Reference Table
company_id company_name company_state
1 Awesome Inc NY
2 Excel-guru AL
3 Clean All MI
Company Table
company_name company_state
abc123 Awesome Inc NULL
Excel gur xyz-987 NULL
Clean All Cleanall NULL
What I want it to do is this. Sorta like a fuzzy match.
Company Table
company_name company_state
abc123 Awesome Inc NY
Excel gur xyz-987 AL
Clean All Cleanall MI
Any help is much appreciated. Thank you.
Upvotes: 5
Views: 7265
Reputation: 30628
Because the incoming data is not in a consistent format, I don't think you will be able to do this with a database. In fact, I would suggest NOT doing it with a database, allowing you to run a matching routine beforehand.
You'll then need to examine as much of the data as possible, and see if you can find any patterns, or things which you can do to the data in bulk to make it easier to match. For example:
I would then suggest something similar to the following:
I would expect that for a while, you should probably flag up low confidence matches, having a human review them, while you tune your process.
You can also store all previous matches for a company, meaning that over time your system might get better. It depends on how much the data varies each day.
Upvotes: 0
Reputation: 1219
Try below Query to update company table:
update company c INNER JOIN company_ref cr
ON c.company_name LIKE concat('%', cr.company_name, '%')
SET c.company_state = cr.company_state;
Another way just by using SELECT
SELECT c.*, cr.* FROM company c INNER JOIN company_ref cr
ON c.company_name LIKE concat('%', cr.company_name, '%');
SQL Fiddle: http://sqlfiddle.com/#!2/ec76f/1
Upvotes: 3
Reputation: 9953
If I understand, the company_name
in the company table always contains the entire string that is in the reference table - it just might contain some junk before or after that string. If so, you just need to find an appropriate string function for your DBMS that lets you check if string A contains string B. For example, with MySQL I think the following will work (not tested):
select c.company_name, r.company_state from company_table c, reference_table r where locate(r.company_name, c.company_name) != 0
that works because the MySQL locate(A, B)
function returns 0 if and only if the string A doesn't occur in the string B.
Upvotes: 0