Reputation: 67
I have these items in my DB:
How can i return above items, when user search for 'hary poter' (not 'harry potter')?
Upvotes: 0
Views: 381
Reputation: 2874
It's difficult to get something that works really well for this sort of thing in SQL Server. Fuzzy matches are really hard to work with when you need to search for spelling mistakes while trying not to get bad matches on things.
For example, the following is one way you could try to do this:
DECLARE @ TABLE (id INT IDENTITY(1, 1), blah NVARCHAR(255));
INSERT @ VALUES ('Harry Potter and the Chamber of Secrets')
,('Harry Potter and the Deathly Hallows: Part 1')
,('Harry Potter and the Deathly Hallows: Part 2')
,('Harry Potter and the Goblet of Fire')
,('Harry Potter and the Half-Blood Prince')
,('Harry Potter and the Order of the Phoenix')
,('Harry Potter and the Prisoner of Azkaban')
,('Harry Potter and the Sorcerer''s Stone');
DECLARE @myVar NVARCHAR(255) = 'deadly halow'; -- returns 2 matches (both parts of Deathly Hallows)
-- SET @myVar = 'hary poter'; -- returns 8 matches, all of them
-- SET @myVar = 'order'; -- returns 1 match (Order of the Phoenix)
-- SET @myVar = 'phoneix'; -- returns 2 matches (Order of the Phoenix and Half-blood Prince, the latter due to a fuzzy match on 'prince')
WITH CTE AS (
SELECT id, blah
FROM @
UNION ALL
SELECT 0, @myVar
)
, CTE2 AS (
SELECT id
, blah
, SUBSTRING(blah, 1, ISNULL(NULLIF(CHARINDEX(' ', blah), 0) - 1, LEN(blah))) individualWord
, NULLIF(CHARINDEX(' ', blah), 0) cIndex
, 1 L
FROM CTE
UNION ALL
SELECT CTE.id
, CTE.blah
, SUBSTRING(CTE.blah, cIndex + 1, ISNULL(NULLIF(CHARINDEX(' ', CTE.blah, cIndex + 1), 0) - 1 - cIndex, LEN(CTE.blah)))
, NULLIF(CHARINDEX(' ', CTE.blah, cIndex + 1), 0)
, L + 1
FROM CTE2
JOIN CTE ON CTE.id = CTE2.id
WHERE cIndex IS NOT NULL
)
SELECT blah
FROM (
SELECT X.blah, ROW_NUMBER() OVER (PARTITION BY X.ID, Y.L ORDER BY (SELECT NULL)) RN, Y.wordCount
FROM CTE2 X
JOIN (SELECT *, COUNT(*) OVER() wordCount FROM CTE2 WHERE id = 0) Y ON DIFFERENCE(X.individualWord, Y.individualWord) >= 3 AND X.id <> 0) T
WHERE RN = 1
GROUP BY blah
HAVING COUNT(*) = MAX(wordCount);
This splits each of the words in the search term, splits each of the words in the titles, then uses the DIFFERENCE()
function, which compares the SOUNDEX()
of the values and tells you how far apart they are. e.g. SOUNDEX('Halow')
is 'H400' and SOUNDEX('Hallows')
is 'H420' - the difference here is 3 (because H, 4 and one of the zeroes match). A perfect match would have a difference of 4, a close match has a difference above 3 generally.
Unfortunately, because you need to check for close matches, you get some false positives with this sometimes. I tested it with, for example, 'phoneix' as the input and got a match on 'Half-blood Prince' due to a fuzzy match between 'prince' and 'phoenix'. I'm sure there are ways this could be improved upon, but something like this should work as a basis for what you're trying to achieve.
Upvotes: 1
Reputation: 13959
You can use this query
create table #test (v varchar(50) )
insert into #test (v) values
('Harry Potter and the Chamber of Secrets' )
,('Harry Potter and the Deathly Hallows: Part 1' )
,('Harry Potter and the Deathly Hallows: Part 2' )
,('Harry Potter and the Goblet of Fire' )
,('Harry Potter and the Half-Blood Prince' )
,('Harry Potter and the Order of the Phoenix' )
,('Harry Potter and the Prisoner of Azkaban' )
,('Harry Potter and the Sorcerer''s Stone' )
select * from #test
where PATINDEX('%[Hh]%ar[r]%y [pP]%ot[t]%er%', v)>0
Upvotes: 0