Reputation: 423
I have a table with a column as nvarchar(max)
with text extracted from word documents in it. How can I create a select query that I'll pass another a list of keywords as parameter and return the rows ordered by the number of matches?
Maybe it is possible with full text search?
Upvotes: 1
Views: 2219
Reputation: 5421
how to ... return the rows ordered by the number of [full-text] matches
I have not used it myself but believe SQL Server 2008 supports weighting the CONTAINSTABLE matches which might be of help to you:
http://msdn.microsoft.com/en-us/library/ms189760.aspx
If you don't have an engine that returns results weighted by the number of hits ...
You could write a UDF that takes two inputs and returns an integer: the big textvalue is the first input and the words you're looking for as a comma-delimited string is the second. The function returns an integer representing either the number of distinct looked-for words that were actually found at least once in the text, or the total number of times the looked-for words were found. Implementation --how to weight-- is up to you. Maybe, for example, you'd want to arrange the looked-for words in most-important to least-important order, and give an important word hit more weight than a less important word hit.
You could then use your full text search engine to find all records that contain at least one of the words (you'd OR them), and you'd run this result set through your UDF scalar function:
pseudo code
select title, weightfunction(summary, 'word1,word2,word3....wordN')
from docs
where summary contains ( word1 or word2 or word3 ... or wordN)
order by weightfunction(summary, 'word1,word2,word3....wordN') desc
Upvotes: 0
Reputation:
Yes, possible with full text search, and likely the best answer. For a straight T-SQL solution, you could use a split function and join, e.g. assuming a table of numbers called dbo.Numbers (you may need to decide on a different upper limit):
SET NOCOUNT ON;
DECLARE @UpperLimit INT;
SET @UpperLimit = 200000;
WITH n AS
(
SELECT
rn = ROW_NUMBER() OVER
(ORDER BY s1.[object_id])
FROM sys.objects AS s1
CROSS JOIN sys.objects AS s2
CROSS JOIN sys.objects AS s3
)
SELECT [Number] = rn - 1
INTO dbo.Numbers
FROM n
WHERE rn <= @UpperLimit + 1;
CREATE UNIQUE CLUSTERED INDEX n ON dbo.Numbers([Number]);
And a splitting function that uses that table of numbers:
CREATE FUNCTION dbo.SplitStrings
(
@List NVARCHAR(MAX)
)
RETURNS TABLE
AS
RETURN
(
SELECT DISTINCT
[Value] = LTRIM(RTRIM(
SUBSTRING(@List, [Number],
CHARINDEX(N',', @List + N',', [Number]) - [Number])))
FROM
dbo.Numbers
WHERE
Number <= LEN(@List)
AND SUBSTRING(N',' + @List, [Number], 1) = N','
);
GO
Then you can simply say:
SELECT key, NvarcharColumn /*, other cols */
FROM dbo.table AS outerT
WHERE EXISTS
(
SELECT 1
FROM dbo.table AS t
INNER JOIN dbo.SplitStrings(N'list,of,words') AS s
ON t.NvarcharColumn LIKE '%' + s.Item + '%'
WHERE t.key = outerT.key
);
As a procedure:
CREATE PROCEDURE dbo.Search
@List NVARCHAR(MAX)
AS
BEGIN
SET NOCOUNT ON;
SELECT key, NvarcharColumn /*, other cols */
FROM dbo.table AS outerT
WHERE EXISTS
(
SELECT 1
FROM dbo.table AS t
INNER JOIN dbo.SplitStrings(@List) AS s
ON t.NvarcharColumn LIKE '%' + s.Item + '%'
WHERE t.key = outerT.key
);
END
GO
Then you can just pass in @List
(e.g. EXEC dbo.Search @List = N'foo,bar,splunge'
) from C#.
This won't be super fast, but I'm sure it will be quicker than pulling all the data out into C# and double-nested loop it manually.
Upvotes: 1