DanSnyder
DanSnyder

Reputation:

Select similar like rows within the same table

I am trying to write a query to find all USERS in a USERS table that are similar. Its like finding all non distinct rows but using a LIKE statement not an equals statement. Below is an example column of a USERS table

 USERNAME 
------------
tim.smith
doug.funny
tim.smith1
dan.snyder
tim.smith20
doug.funny2
emily.hunt

after query the output should look like this.

tim.smith
tim.smith1
tim.smith20
doug.funny
doug.funny2

Upvotes: 2

Views: 1407

Answers (7)

mrdenny
mrdenny

Reputation: 5078

SSIS has functions in it to handle this for you. Run the table through SSIS and it will spid out the tables and the matches along with how close they match.

Upvotes: 0

cmsjr
cmsjr

Reputation: 59205

I have been looking for a legitimate reason to use a cross join

Select Distinct u1.UserID 
from username  u1 Cross join username u2
where u1.UserID <> u2.UserID
and 
(PatIndex('%' + u1.UserID  + '%', u2.UserID)  <> 0
 OR
PatIndex('%' + u2.UserID  + '%', u1.UserID)  <> 0)
order by u1.UserID

doug.funny
doug.funny2
tim.smith
tim.smith1
tim.smith20

Upvotes: 1

Jens Schauder
Jens Schauder

Reputation: 81950

Or you could write a function computing the levenshtein distance (http://de.wikipedia.org/wiki/Levenshtein-Distanz) and then join the table using an expression like

levenshtein(a.username, b.username) <=2

you might want to remove any special characters from the names before that, if you considere such characters 'garbage'

Upvotes: 0

Kyle Hale
Kyle Hale

Reputation: 8120

Fancy UDF based on the Levenshtein distance:

CREATE FUNCTION edit_distance(@s1 nvarchar(3999), @s2 nvarchar(3999))
RETURNS int
AS
BEGIN
  DECLARE @s1_len int, @s2_len int, @i int, @j int, @s1_char nchar, @c int, @c_temp int,
    @cv0 varbinary(8000), @cv1 varbinary(8000)
  SELECT @s1_len = LEN(@s1), @s2_len = LEN(@s2), @cv1 = 0x0000, @j = 1, @i = 1, @c = 0
  WHILE @j <= @s2_len
    SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1
  WHILE @i <= @s1_len
  BEGIN
    SELECT @s1_char = SUBSTRING(@s1, @i, 1), @c = @i, @cv0 = CAST(@i AS binary(2)), @j = 1
    WHILE @j <= @s2_len
    BEGIN
      SET @c = @c + 1
      SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) +
        CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END
      IF @c > @c_temp SET @c = @c_temp
      SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1
      IF @c > @c_temp SET @c = @c_temp
      SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1
    END
    SELECT @cv1 = @cv0, @i = @i + 1
  END
  RETURN @c
END

(Cribbed from SQLTeam.com forums.)

Unfortunately, this only works when provided two strings, and in O(nm) time, where n and m are the length of the two strings. However, you could improve upon this by modifying the function to cut out as a "NO MATCH" when @c is greater than, say, 3.

But then you'd still need a cursor to walk through each user name in the table and compare it to every other user. Not very efficient, that's for sure.

Upvotes: 0

weiyin
weiyin

Reputation: 6979

It would be helpful if you define exactly what you mean by "similar". Do the entries always follow a certain pattern, such as "letters, period, letters, optional numbers"?

If you're looking for typos or possible misspellings, you can try fuzzy string matching algorithms, such as soundex or Levenshtein edit distance.

Upvotes: 0

Jared
Jared

Reputation: 39883

Couldn't you just do

select * from USERTABLE order by USERNAME;

Upvotes: 0

Eric Petroelje
Eric Petroelje

Reputation: 60508

If your data is exactly as you describe, you could just trim the numbers off the end before ordering them, but I suspect your actual data is more complicated than that.

You might want to look at the sql server SOUNDEX and DIFFERENCE functions. It may not be exactly what you need, but it would probably get you close

Upvotes: 1

Related Questions