Dragos
Dragos

Reputation: 181

t-sql matching words when insert with data on a table

I have a web application through a user can upload an Excel file with some destination. When the file is uploaded, I read the rows and insert them in a SQL Server database.
On the SQL Server I have to do a matching of the destination with a list of destinations in a table. As the list of destinations in the database is the reference, matching must be accurate.

Here is an example of a destination from database and a destination uploaded by the user (among which must be made to match):

What is the best way to make more accurate matching?

Upvotes: 1

Views: 404

Answers (3)

Ruwan Pathiranage
Ruwan Pathiranage

Reputation: 11

I have solved lots of problems like this. Split the database data into relevant columns (Country, Device, Brand) in a temp table. Split the user input data (excel) into relevant columns (Country, Device, Brand) before you import into the database. Then import excel data into a temp table. Then you can adjust your matching anyway you want.

Upvotes: 1

DataMaster
DataMaster

Reputation: 161

You need to define a matching algorithm. If it is by counting words that match, no matter what order they occur, here it is:

declare @t table(field varchar(200))
insert into @t values('United Kingdom - Mobile - O2')
declare @upload varchar(200) = ' United   Kingdom  -  O2    Mobile noise'

-- Let's find matching words, no matter in what order they are!
declare @IgnoreChars varchar(50) = char(13)+char(10)+char(9)+'-.,'
select t.field,
    MatchedWords = SUM(CASE WHEN m.WordFoundAt=0 THEN 0 ELSE 1 END),
    TotalWords = COUNT(*)
from @t t
    CROSS APPLY dbo.str_split(dbo.str_translate(@upload, @IgnoreChars, REPLICATE(' ', LEN(@IgnoreChars))), ' ') w
    OUTER APPLY (SELECT WordFoundAt = CHARINDEX(w.id, t.field)) m
where w.id <> ''
group by t.field

Result:

field MatchedWords TotalWords

United Kingdom - Mobile - O2 4 5

Functions str_translate and str_split are not built-in, but I don't know how to post them here since attachments are not allowed.

Upvotes: 0

Flaviu Boldea
Flaviu Boldea

Reputation: 11

I don't think this problem can be solved using T-SQL only. Unfortunately T-SQL has no good algorithms for fuzzy matching. Soundex is not very relevant, full text search neither for this problem.

I would recommend a very good library written in C# http://anastasiosyal.com/post/2009/01/11/Beyond-SoundEx-Functions-for-Fuzzy-Searching-in-MS-SQL-Server. It implements a lot of string metric algorithms like and can be imported as CLR functions in SQL Server. Can have performance issues for a large amount of data.

I also can recommend, especially because you import data, to create a SSIS package. In a package you can use Fuzzy Lookup Transformation block to identify similarities: http://msdn.microsoft.com/en-us/magazine/cc163731.aspx. I use it to identify duplicates, based on similarity, in a table with more than 1 million records. Also in both cases you will have to run some tests in order to define the percent of similarity for an accurate matching in case of your business.

Upvotes: 1

Related Questions