Reputation: 181
I have a web application through a user can upload an Excel file with some destination. When the file is uploaded, I read the rows and insert them in a SQL Server database.
On the SQL Server I have to do a matching of the destination with a list of destinations in a table. As the list of destinations in the database is the reference, matching must be accurate.
Here is an example of a destination from database and a destination uploaded by the user (among which must be made to match):
What is the best way to make more accurate matching?
Upvotes: 1
Views: 404
Reputation: 11
I have solved lots of problems like this. Split the database data into relevant columns (Country, Device, Brand) in a temp table. Split the user input data (excel) into relevant columns (Country, Device, Brand) before you import into the database. Then import excel data into a temp table. Then you can adjust your matching anyway you want.
Upvotes: 1
Reputation: 161
You need to define a matching algorithm. If it is by counting words that match, no matter what order they occur, here it is:
declare @t table(field varchar(200))
insert into @t values('United Kingdom - Mobile - O2')
declare @upload varchar(200) = ' United Kingdom - O2 Mobile noise'
-- Let's find matching words, no matter in what order they are!
declare @IgnoreChars varchar(50) = char(13)+char(10)+char(9)+'-.,'
select t.field,
MatchedWords = SUM(CASE WHEN m.WordFoundAt=0 THEN 0 ELSE 1 END),
TotalWords = COUNT(*)
from @t t
CROSS APPLY dbo.str_split(dbo.str_translate(@upload, @IgnoreChars, REPLICATE(' ', LEN(@IgnoreChars))), ' ') w
OUTER APPLY (SELECT WordFoundAt = CHARINDEX(w.id, t.field)) m
where w.id <> ''
group by t.field
Result:
field MatchedWords TotalWords
United Kingdom - Mobile - O2 4 5
Functions str_translate and str_split are not built-in, but I don't know how to post them here since attachments are not allowed.
Upvotes: 0
Reputation: 11
I don't think this problem can be solved using T-SQL only. Unfortunately T-SQL has no good algorithms for fuzzy matching. Soundex is not very relevant, full text search neither for this problem.
I would recommend a very good library written in C# http://anastasiosyal.com/post/2009/01/11/Beyond-SoundEx-Functions-for-Fuzzy-Searching-in-MS-SQL-Server. It implements a lot of string metric algorithms like and can be imported as CLR functions in SQL Server. Can have performance issues for a large amount of data.
I also can recommend, especially because you import data, to create a SSIS package. In a package you can use Fuzzy Lookup Transformation block to identify similarities: http://msdn.microsoft.com/en-us/magazine/cc163731.aspx. I use it to identify duplicates, based on similarity, in a table with more than 1 million records. Also in both cases you will have to run some tests in order to define the percent of similarity for an accurate matching in case of your business.
Upvotes: 1