How to find similar results in SQL Server?

Question

I have these items in my DB:

Harry Potter and the Chamber of Secrets
Harry Potter and the Deathly Hallows: Part 1
Harry Potter and the Deathly Hallows: Part 2
Harry Potter and the Goblet of Fire
Harry Potter and the Half-Blood Prince
Harry Potter and the Order of the Phoenix
Harry Potter and the Prisoner of Azkaban
Harry Potter and the Sorcerer's Stone
And other movies...

How can i return above items, when user search for 'hary poter' (not 'harry potter')?

ZLK · Accepted Answer

It's difficult to get something that works really well for this sort of thing in SQL Server. Fuzzy matches are really hard to work with when you need to search for spelling mistakes while trying not to get bad matches on things.

For example, the following is one way you could try to do this:

DECLARE @ TABLE (id INT IDENTITY(1, 1), blah NVARCHAR(255));

INSERT @ VALUES ('Harry Potter and the Chamber of Secrets')
,('Harry Potter and the Deathly Hallows: Part 1')
,('Harry Potter and the Deathly Hallows: Part 2')
,('Harry Potter and the Goblet of Fire')
,('Harry Potter and the Half-Blood Prince')
,('Harry Potter and the Order of the Phoenix')
,('Harry Potter and the Prisoner of Azkaban')
,('Harry Potter and the Sorcerer''s Stone');

DECLARE @myVar NVARCHAR(255) = 'deadly halow'; -- returns 2 matches (both parts of Deathly Hallows)
-- SET @myVar = 'hary poter'; -- returns 8 matches, all of them
-- SET @myVar = 'order'; -- returns 1 match (Order of the Phoenix)
-- SET @myVar = 'phoneix'; -- returns 2 matches (Order of the Phoenix and Half-blood Prince, the latter due to a fuzzy match on 'prince')

WITH CTE AS (
    SELECT id, blah
    FROM @
    UNION ALL
    SELECT 0, @myVar
    )
, CTE2 AS (
    SELECT id
         , blah
         , SUBSTRING(blah, 1, ISNULL(NULLIF(CHARINDEX(' ', blah), 0) - 1, LEN(blah))) individualWord
         , NULLIF(CHARINDEX(' ', blah), 0) cIndex
         , 1 L
    FROM CTE
    UNION ALL 
    SELECT CTE.id
         , CTE.blah
         , SUBSTRING(CTE.blah, cIndex + 1, ISNULL(NULLIF(CHARINDEX(' ', CTE.blah, cIndex + 1), 0) - 1 - cIndex, LEN(CTE.blah)))
         , NULLIF(CHARINDEX(' ', CTE.blah, cIndex + 1), 0)
         , L + 1
    FROM CTE2
    JOIN CTE ON CTE.id = CTE2.id
    WHERE cIndex IS NOT NULL
    )
SELECT blah
FROM (
    SELECT X.blah, ROW_NUMBER() OVER (PARTITION BY X.ID, Y.L ORDER BY (SELECT NULL)) RN, Y.wordCount
    FROM CTE2 X
    JOIN (SELECT *, COUNT(*) OVER() wordCount FROM CTE2 WHERE id = 0) Y ON DIFFERENCE(X.individualWord, Y.individualWord) >= 3 AND X.id <> 0) T
WHERE RN = 1
GROUP BY blah
HAVING COUNT(*) = MAX(wordCount);

This splits each of the words in the search term, splits each of the words in the titles, then uses the DIFFERENCE() function, which compares the SOUNDEX() of the values and tells you how far apart they are. e.g. SOUNDEX('Halow') is 'H400' and SOUNDEX('Hallows') is 'H420' - the difference here is 3 (because H, 4 and one of the zeroes match). A perfect match would have a difference of 4, a close match has a difference above 3 generally.

Unfortunately, because you need to check for close matches, you get some false positives with this sometimes. I tested it with, for example, 'phoneix' as the input and got a match on 'Half-blood Prince' due to a fuzzy match between 'prince' and 'phoenix'. I'm sure there are ways this could be improved upon, but something like this should work as a basis for what you're trying to achieve.

How to find similar results in SQL Server?

Answers (2)

Related Questions