Reputation: 23901
Table of People (name, dob, ssn etc)
Table of NewRecords (name, dob, ssn)
I want to write a query that determines which NewRecords do not in any way match any of the People (it's an update query that sets a flag in the NewRecords table).
Specifically, I want to find the NewRecords for which the Levenshtein distance between the first name, last name and ssn is greater than 2 for all records in People. (i.e. the person has a first, last and ssn that are all different from those in People and so is likely not a match).
I added a user-defined Levensthein function Levenshtein distance in T-SQL and have already added an optimization that adds an extra parameter for maximal allowed distance. (If the calculated levenstein climbs above the max allowed, the function exits early). But the query is still taking an unacceptably long time because the tables are large.
What can I do to speed things up? How do I start thinking about optimization and performance? At what point do I need to start digging into the internals of sql server?
update NewRecords
set notmatchflag=true
from
newRecords elr
inner join People pro
on
dbo.Levenshtein_withThreshold(elr.last,pro.last,2)>2 and
dbo.Levenshtein_withThreshold(elr.first,pro.first,2)>2 and
elr.ssn is null and
elr.dob<>pro.dob
Upvotes: 1
Views: 325
Reputation: 45096
For one skip the values that are true so if you run it again then it will not process those.
That distance is expensive - try to eliminate those that don't have a chance first.
If the length differs by more than 2 then I don't think you can have a distance <= 2.
update NewRecords
set notmatchflag=true
from newRecords elr
inner join People pro
on notmatchflag = false
and elr.ssn is null
and elr.dob <> ro.dob
and dbo.Levenshtein_withThreshold( elr.last, pro.last,2) > 2
and dbo.Levenshtein_withThreshold(elr.first,pro.first,2) > 2
Upvotes: 1
Reputation: 13141
There seems to be nothing wrong with your query (maybe except for the fact, that you want to find all possible combinations of differing values, which in most scenarios will give a lot of results :)).
Anyway, the problem is your Levenstein functions - I assume that they're written in T-SQL. Even if you optimized them, they're still weak. You really should compile them to CLR (the link that you posted already contain an example) - this will be an order of magnitude faster.
Another idea I'd try with what you've got, is to somehow decrease the number of Levenstein comparisons. Maybe find other conditions, or reverse the query: find all MATCHING records, and then select what's left (it may enable you to introduce those additional conditions.
But Levenstein compiled to CLR is your best option.
Upvotes: 1
Reputation: 5869
Due to not know exactly the table structure and types of data I'm not 100% sure that this would work, but give it a go anyway!
I would first check the SQL execution plan when you test it, usually there will be some sections that will be taking the most amount of time. From there you should be able to gauge where/if any indexes would help.
My gut feeling though is your function that is being called A LOT by the looks of things, but hopefully the execution plan will determine if that is the case. If it is then a CLR stored procedure might be the way to go.
Upvotes: 3