Reputation: 2118
Here's the issue I am having. I want to return rows that are similiar from my table in my database. Consider two rows with the following text:
'Cisco phones cannot dial out'
'Phones are not working for outgoing calls'
These are two different rows...I tried to do something like this:
DECLARE @TheTest varchar(1000)
DECLARE @TheResult varchar(1000)
SET @TheTest = ('Cisco phones cannot dial out')
SET @TheResult = ('Phones are not working for outgoing calls')
CREATE TABLE #Test
(
MyCol varchar(1000)
)
INSERT INTO #Test(MyCol)
SELECT @TheResult
SELECT * FROM #Test WHERE LOWER(MyCol) LIKE '%' + LOWER(@TheTest) + '%'
DROP TABLE #Test
But the result is 0 rows were returned, well I understand that because the string TheTest is not close enough to the string @TheResult...But I need a solution that would actually return this row because the word phones appear in both texts.
Would I have to build something much more elaborate to seperate words and get rid of common words for something like this? I'd like to mimic functionality that I see on a specific website:
Upvotes: 0
Views: 176
Reputation: 45096
From what I understand of your question not even SQL FullText is going to get what you want.
You are looking for Lucene type features.
I think they use Lucene on SO for finding like questions.
A short cut is to parse the words and then take the stem (Porter) and populate a table.
Just using Regex on a word break is a start.
Or you can skip the stemming but that will miss a lot of matches (e.g. match).
Only index unique words / stems.
If a words show up 4 times in one phrase and 2 times in the other it is problematic to score.
And I think a single match per word is a more meaningful score.
Then do a join and count the number of words joined.
Need to normalize as 12 matches on 12 words is better than 14 matches on 20 words.
Like 2 * matches / (word count A + word count B).
The other direction is to not stem but use a fuzzy match like a Levenstein distance.
In your sample only phone matches so that would have a low score.
But even a Lucene or Google would have trouble giving that a high score.
Breaking down the English language for two phrases that have different words but mean the same is very complex.
Upvotes: 2
Reputation: 460208
One way would be to use a Split
function(here's a demo):
CREATE FUNCTION [dbo].[Split]
(
@ItemList NVARCHAR(MAX),
@delimiter CHAR(1)
)
RETURNS @IDTable TABLE (Item VARCHAR(50))
AS
BEGIN
DECLARE @tempItemList NVARCHAR(MAX)
SET @tempItemList = @ItemList
DECLARE @i INT
DECLARE @Item NVARCHAR(4000)
SET @i = CHARINDEX(@delimiter, @tempItemList)
WHILE (LEN(@tempItemList) > 0)
BEGIN
IF @i = 0
SET @Item = @tempItemList
ELSE
SET @Item = LEFT(@tempItemList, @i - 1)
INSERT INTO @IDTable(Item) VALUES(@Item)
IF @i = 0
SET @tempItemList = ''
ELSE
SET @tempItemList = RIGHT(@tempItemList, LEN(@tempItemList) - @i)
SET @i = CHARINDEX(@delimiter, @tempItemList)
END
RETURN
END
Now you can check if one of the words is contained in the column:
DECLARE @TheTest varchar(1000)
DECLARE @TheResult varchar(1000)
SET @TheTest = ('Cisco phones cannot dial out')
SET @TheResult = ('Phones are not working for outgoing calls')
CREATE TABLE #Test
(
MyCol varchar(1000)
)
INSERT INTO #Test(MyCol)
SELECT @TheResult
Declare @searchWords Table(Item varchar(100));
INSERT INTO @searchWords
SELECT Item FROM dbo.Split(@TheTest, ' ');
SELECT * FROM #Test t
WHERE EXISTS
(
SELECT 1
FROM dbo.Split(t.MyCol, ' ')cw INNER JOIN @searchWords sw
ON cw.Item = sw.Item
);
DROP TABLE #Test
Upvotes: 0
Reputation: 8113
You might want to try SOUNDEX
to return the similarity between two strings : http://msdn.microsoft.com/en-us/library/ms187384.aspx
It returns a varchar that holds the similarity between two strings. You can then evaluate if the varchar is in an acceptable "similarity range" that you defined.
Upvotes: 0