Reputation: 727
I have a database with fields that can contain long phrases of words. I wanted the ability to quickly search for a keyword or phrase in these columns, but when searching a phrase, I want to be able to search the phrase like Google would, returning all rows that contain all of the specified words, but in no particular order or "nearness" to each other. Ranking the results by relevance is unnecessary at this point.
After reading about SQL Server's Full-Text Search, I thought it would be just what I needed: a searchable index based on each word in a text-based column. My end goal is to safely accept user input and turn it into a query that leverages the speed of Full-Text Search, while maintaining ease-of-use for the users.
I see the FREETEXT
function can take an entire phrase, break it up into "useful" words (ignoring words like 'and', 'or', 'the', etc), and then return a list of matching rows very quickly, even with a complex search term. But when you try to use it, you may notice that instead of an AND
search for each of the terms, it seems to only do an OR
search. Maybe there's a way to change its behavior, but I haven't found anything useful.
Then there's CONTAINS
, which can accept a boolean query phrase, but sometimes with odd results.
Take a look at the following queries on this table:
PKID Name
----- -----
1 James Kirk
2 James Cameron
3 Kirk Cameron
4 Kirk For Cameron
Q1: SELECT Name FROM tblName WHERE FREETEXT(Name, 'james')
Q2: SELECT Name FROM tblName WHERE FREETEXT(Name, 'james kirk')
Q3: SELECT Name FROM tblName WHERE FREETEXT(Name, 'kirk for cameron')
Q4: SELECT Name FROM tblName WHERE CONTAINS(Name, 'james')
Q5: SELECT Name FROM tblName WHERE CONTAINS(Name, '"james kirk"')
Q6: SELECT Name FROM tblName WHERE CONTAINS(Name, '"kirk james"')
Q7: SELECT Name FROM tblName WHERE CONTAINS(Name, 'james AND kirk')
Q8: SELECT Name FROM tblName WHERE CONTAINS(Name, 'kirk AND for AND cameron')
SELECT Name FROM tblName WHERE FREETEXT(Name, 'james')
Returns "James Kirk" and "James Cameron". Alright, lets narrow it down...
SELECT Name FROM tblName WHERE FREETEXT(Name, 'james kirk')
Guess what. Now you'll get "James Kirk", "James Cameron", and "Kirk For Cameron". Same thing happens for Query 3, so let's just skip that.
SELECT Name FROM tblName WHERE CONTAINS(Name, 'james')
Same results as Query 1. Okay. Narrow the results maybe...?
SELECT Name FROM tblName WHERE CONTAINS(Name, '"james kirk"')
After discovering that you need to enclose the string in double-quotes if there are spaces, I find that this query works great on this particular dataset for the results I desire! Only "James Kirk" is returned. Wonderful! Or is it...
SELECT Name FROM tblName WHERE CONTAINS(Name, '"kirk james"')
Crap. No. It is matching that exact phrase. Hmmm... After checking the syntax for T-SQL's CONTAINS function, I see that you can throw boolean keywords in there, and it looks like that might be the answer. Let's see...
SELECT Name FROM tblName WHERE CONTAINS(Name, 'james AND kirk')
Neat. I get all three results, as expected. Now I just write a function to cram the word AND
between all the words. Done, right? What now...
SELECT Name FROM tblName WHERE CONTAINS(Name, 'kirk AND for AND cameron')
This query knows exactly what it's looking for, except for some reason, there are no results. Why? Well after reading about Stopwords and Stoplists, I will make an educated guess and say that because I'm asking for the intersection of the index results for "kirk", "for", and "cameron", and the word "for" will not have any results (what with it being a stopword and all), then the result of any intersection with that result is also empty. Whether or not it actually functions like that is irrelevant to me, since that is the observable behavior of the CONTAINS
function every time I do a boolean search with a stopword in there.
So I need a new solution.
NEAR
Looks promising. If I can take a user query and put commas between it, this will... wait this is the same thing as using boolean AND
in CONTAINS
queries. But does it ignore stopwords correctly?
SELECT Name FROM tblName WHERE CONTAINS(Name, 'NEAR(kirk, for, cameron)')
Nope. No results. Remove the word "for", and you get all three results again. :(
Upvotes: 14
Views: 21812
Reputation: 96
I am using ISAbout with combination of THESAURUS, INFLECTIONAL and Wildcards
Advantages are
1 - Order of words in the search string doesn't matter
2 - Search similar words (THESAURUS)
3 - Treat runs,running,ran,run as same (INFLECTIONAL)
4 - Near doesn't return result if search string has even one element which isn't in the resulting string, but ISAbout will always return the most desirable result
5 - You can set weight of different words, which will further help you to optimize correctness of your result
SELECT K.RANK, name, Description
FROM Diagnosis AS C
INNER JOIN
CONTAINSTABLE(diagnosis,name,<br> 'isAbout(FORMSOF (THESAURUS, "CHRONIC") weight(1.0),FORMSOF (INFLECTIONAL, "CHRONIC") weight(1.0),CHRONIC* weight(1.0)
FORMSOF (THESAURUS, "FAILURE") weight(1.0),FORMSOF (INFLECTIONAL, "FAILURE") weight(1.0),FAILURE* weight(1.0),
FORMSOF (THESAURUS, "DIASTOLIC") weight(1.0),FORMSOF (INFLECTIONAL, "DIASTOLIC") weight(1.0),DIASTOLIC* weight(1.0))')
AS K
ON C.ID = K.[KEY];
I am still looking for ways to optimize it.
Note : I remove stop words from my search string programmatically.
Upvotes: 2
Reputation: 727
I found another question on here that deals with this same topic. In fact, the post detailing the method is even titled "A Google-like Full Text Search". It uses an open-source library called Irony to parse a user-entered search string and turn it into a FTS-compatible query.
Here is the source code for the latest version of the Google-like Full-Text Search.
Upvotes: 3
Reputation: 727
NOTE: To run this function requires server administrator permissions, so it's probably not a viable solution for a production environment.
I have learned enough about how the Full-Text Search functions to find the sys.dm_fts_parser debugging query. It will show you how the Full-Text Search query functions will look at your input. It splits up the search terms into rows by word (according to the specified stoplist), and then assigns information to each word, such as position in the query, whether the word is on a stoplist, etc.
So, I guess I have an idea. What if I use this debugging function to do what I was doing before, but this time ignore the words on the stoplist?
Take a look at this query that works perfectly, but looks like an absolute nightmare to understand:
DECLARE @input AS varchar(1000) = 'kirk for cameron',
@query AS varchar(1000),
@inputQuery AS varchar(1000),
@tokens AS TABLE (display_term varchar(255))
INSERT INTO @tokens
SELECT display_term
FROM sys.dm_fts_parser('"' + @input + '"', 1033, 0, 0) -- double quotes are required around the input search term
WHERE special_term = 'Exact Match' -- filters out stopwords
SET @query = STUFF(
(SELECT ' AND ' + display_term
FROM @tokens
FOR XML PATH ('')), 1, 5, '')
SET @inputQuery = '"' + @input + '"'
PRINT @query -- 'kirk AND cameron'
PRINT @inputQuery -- '"kirk for cameron"'
SELECT * FROM tblName WHERE CONTAINS(Name, @inputQuery)
UNION
SELECT * FROM tblName WHERE CONTAINS(Name, @query)
AND
between all the good search terms, using a tricky implementation of FOR XML PATH
PRINT
the variables, so that I can check their values while debuggingCONTAINS
search, using the two queries I've built, including exact matches at the top of the result setTwo results are returned! A user might be confused momentarily by the additional result, but type this into Google, and I bet it'll return results the same way.
PKID Name
----- -----
4 Kirk For Cameron
3 Kirk Cameron
Upvotes: 0
Reputation: 1514
Have you looked at using the Semantic Index functions in SQL Server 2012?
They are built on full text indexes but extend them to include details about word frequency. I used them just recently to build a word cloud and it was really good.
There are some good articles to be found on the internet and you can also search for words that are 'near' each other in docs. I set up the full text index across 2 nvarchar columns and then enable sematic indexing.
These links will get you started but I think it will give you what you need.
Upvotes: 2