oJM86o
oJM86o

Reputation: 2118

Cannot use LIKE so what can I use instead

Here's the issue I am having. I want to return rows that are similiar from my table in my database. Consider two rows with the following text:

'Cisco phones cannot dial out'
'Phones are not working for outgoing calls'

These are two different rows...I tried to do something like this:

DECLARE @TheTest varchar(1000)
DECLARE @TheResult varchar(1000)

SET @TheTest = ('Cisco phones cannot dial out')
SET @TheResult = ('Phones are not working for outgoing calls')

CREATE TABLE #Test
(
MyCol varchar(1000)
)

INSERT INTO #Test(MyCol)
SELECT @TheResult

SELECT * FROM #Test WHERE LOWER(MyCol) LIKE '%' + LOWER(@TheTest) + '%'

DROP TABLE #Test

But the result is 0 rows were returned, well I understand that because the string TheTest is not close enough to the string @TheResult...But I need a solution that would actually return this row because the word phones appear in both texts.

Would I have to build something much more elaborate to seperate words and get rid of common words for something like this? I'd like to mimic functionality that I see on a specific website:

enter image description here

Upvotes: 0

Views: 176

Answers (3)

paparazzo
paparazzo

Reputation: 45096

From what I understand of your question not even SQL FullText is going to get what you want.
You are looking for Lucene type features.
I think they use Lucene on SO for finding like questions.

A short cut is to parse the words and then take the stem (Porter) and populate a table.
Just using Regex on a word break is a start.
Or you can skip the stemming but that will miss a lot of matches (e.g. match).
Only index unique words / stems.
If a words show up 4 times in one phrase and 2 times in the other it is problematic to score.
And I think a single match per word is a more meaningful score.
Then do a join and count the number of words joined.
Need to normalize as 12 matches on 12 words is better than 14 matches on 20 words.
Like 2 * matches / (word count A + word count B).

The other direction is to not stem but use a fuzzy match like a Levenstein distance.

In your sample only phone matches so that would have a low score.
But even a Lucene or Google would have trouble giving that a high score.
Breaking down the English language for two phrases that have different words but mean the same is very complex.

Upvotes: 2

Tim Schmelter
Tim Schmelter

Reputation: 460208

One way would be to use a Split function(here's a demo):

CREATE FUNCTION [dbo].[Split]
(
    @ItemList NVARCHAR(MAX), 
    @delimiter CHAR(1)
)
RETURNS @IDTable TABLE (Item VARCHAR(50))  
AS      

BEGIN    
    DECLARE @tempItemList NVARCHAR(MAX)
    SET @tempItemList = @ItemList

    DECLARE @i INT    
    DECLARE @Item NVARCHAR(4000)

    SET @i = CHARINDEX(@delimiter, @tempItemList)

    WHILE (LEN(@tempItemList) > 0)
    BEGIN
        IF @i = 0
            SET @Item = @tempItemList
        ELSE
            SET @Item = LEFT(@tempItemList, @i - 1)
        INSERT INTO @IDTable(Item) VALUES(@Item)
        IF @i = 0
            SET @tempItemList = ''
        ELSE
            SET @tempItemList = RIGHT(@tempItemList, LEN(@tempItemList) - @i)
        SET @i = CHARINDEX(@delimiter, @tempItemList)
    END 
    RETURN
END  

Now you can check if one of the words is contained in the column:

DECLARE @TheTest varchar(1000)
DECLARE @TheResult varchar(1000)

SET @TheTest = ('Cisco phones cannot dial out')
SET @TheResult = ('Phones are not working for outgoing calls')

CREATE TABLE #Test
(
MyCol varchar(1000)
)

INSERT INTO #Test(MyCol)
    SELECT @TheResult

Declare  @searchWords Table(Item varchar(100));
INSERT INTO @searchWords
    SELECT Item FROM dbo.Split(@TheTest, ' ');

SELECT * FROM #Test t
WHERE EXISTS
(
    SELECT 1 
    FROM dbo.Split(t.MyCol, ' ')cw INNER JOIN @searchWords sw
        ON cw.Item = sw.Item
);

DROP TABLE #Test 

Upvotes: 0

Dominic Goulet
Dominic Goulet

Reputation: 8113

You might want to try SOUNDEX to return the similarity between two strings : http://msdn.microsoft.com/en-us/library/ms187384.aspx

It returns a varchar that holds the similarity between two strings. You can then evaluate if the varchar is in an acceptable "similarity range" that you defined.

Upvotes: 0

Related Questions