Reputation:
I am trying to write a query to find all USERS in a USERS table that are similar. Its like finding all non distinct rows but using a LIKE statement not an equals statement. Below is an example column of a USERS table
USERNAME
------------
tim.smith
doug.funny
tim.smith1
dan.snyder
tim.smith20
doug.funny2
emily.hunt
after query the output should look like this.
tim.smith
tim.smith1
tim.smith20
doug.funny
doug.funny2
Upvotes: 2
Views: 1407
Reputation: 5078
SSIS has functions in it to handle this for you. Run the table through SSIS and it will spid out the tables and the matches along with how close they match.
Upvotes: 0
Reputation: 59205
I have been looking for a legitimate reason to use a cross join
Select Distinct u1.UserID
from username u1 Cross join username u2
where u1.UserID <> u2.UserID
and
(PatIndex('%' + u1.UserID + '%', u2.UserID) <> 0
OR
PatIndex('%' + u2.UserID + '%', u1.UserID) <> 0)
order by u1.UserID
doug.funny
doug.funny2
tim.smith
tim.smith1
tim.smith20
Upvotes: 1
Reputation: 81950
Or you could write a function computing the levenshtein distance (http://de.wikipedia.org/wiki/Levenshtein-Distanz) and then join the table using an expression like
levenshtein(a.username, b.username) <=2
you might want to remove any special characters from the names before that, if you considere such characters 'garbage'
Upvotes: 0
Reputation: 8120
Fancy UDF based on the Levenshtein distance:
CREATE FUNCTION edit_distance(@s1 nvarchar(3999), @s2 nvarchar(3999))
RETURNS int
AS
BEGIN
DECLARE @s1_len int, @s2_len int, @i int, @j int, @s1_char nchar, @c int, @c_temp int,
@cv0 varbinary(8000), @cv1 varbinary(8000)
SELECT @s1_len = LEN(@s1), @s2_len = LEN(@s2), @cv1 = 0x0000, @j = 1, @i = 1, @c = 0
WHILE @j <= @s2_len
SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1
WHILE @i <= @s1_len
BEGIN
SELECT @s1_char = SUBSTRING(@s1, @i, 1), @c = @i, @cv0 = CAST(@i AS binary(2)), @j = 1
WHILE @j <= @s2_len
BEGIN
SET @c = @c + 1
SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) +
CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END
IF @c > @c_temp SET @c = @c_temp
SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1
IF @c > @c_temp SET @c = @c_temp
SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1
END
SELECT @cv1 = @cv0, @i = @i + 1
END
RETURN @c
END
(Cribbed from SQLTeam.com forums.)
Unfortunately, this only works when provided two strings, and in O(nm) time, where n and m are the length of the two strings. However, you could improve upon this by modifying the function to cut out as a "NO MATCH" when @c is greater than, say, 3.
But then you'd still need a cursor to walk through each user name in the table and compare it to every other user. Not very efficient, that's for sure.
Upvotes: 0
Reputation: 6979
It would be helpful if you define exactly what you mean by "similar". Do the entries always follow a certain pattern, such as "letters, period, letters, optional numbers"?
If you're looking for typos or possible misspellings, you can try fuzzy string matching algorithms, such as soundex or Levenshtein edit distance.
Upvotes: 0
Reputation: 60508
If your data is exactly as you describe, you could just trim the numbers off the end before ordering them, but I suspect your actual data is more complicated than that.
You might want to look at the sql server SOUNDEX and DIFFERENCE functions. It may not be exactly what you need, but it would probably get you close
Upvotes: 1