MAK
MAK

Reputation: 7260

Optimal search string in the where clause

Want to search the string using PATINDEX and SOUNDEX within the WHERE clause or any optimal way.

I have the following table with some sample data to search the given string using PATINDEX and SOUNDEX.

create table tbl_pat_soundex
(
    col_str varchar(max)
);

insert into tbl_pat_soundex values('Smith A Steve');
insert into tbl_pat_soundex values('Steve A Smyth');
insert into tbl_pat_soundex values('A Smeeth Stive');
insert into tbl_pat_soundex values('Steve Smith A');
insert into tbl_pat_soundex values('Smit Steve A');

Note: I have 100 Millions of records in the table to search for.

String to search:- 'Smith A Steve'

SELECT col_str
FROM tbl_pat_soundex
WHERE PATINDEX('%Smith%',col_str) >= 1 AND PATINDEX('%A%',col_str) >= 1 AND PATINDEX('%Steve%',col_str) >= 1

Getting Output:

col_str
--------------
Smith A Steve
Steve Smith A

Expected Output:

col_str         
----------------
Smith A Steve   
Steve A Smyth   
A Smeeth Stive  
Steve Smith A   
Smit Steve A    

Tried:

1:

SELECT col_str
FROM tbl_pat_soundex
WHERE PATINDEX('%Smith%',col_str) >= 1 AND 
      PATINDEX('%A%',col_str) >= 1 AND 
      PATINDEX('%Steve%',col_str) >= 1

2:

SELECT col_str
FROM tbl_pat_soundex
WHERE PATINDEX('%'+SOUNDEX('Smith')+'%',SOUNDEX(col_str)) >= 1 AND 
      PATINDEX('%'+SOUNDEX('A')+'%',SOUNDEX(col_str)) >= 1 AND 
      PATINDEX('%'+SOUNDEX('Steve')+'%',SOUNDEX(col_str)) >= 1

3:

SELECT col_str
FROM tbl_pat_soundex    
WHERE DIFFERENCE('Smith',col_str) = 4 AND 
      DIFFERENCE('A',col_str) =4 AND 
      DIFFERENCE('Steve',col_str) = 4

4:

--Following was taking huge time(was kept running more than 20 minutes) to execute.
SELECT DISTINCT col_str
FROM tbl_pat_soundex [a]
CROSS APPLY SplitString([a].[col_str], ' ') [b]
WHERE DIFFERENCE([b].Item,'Smith') >= 1 AND 
      DIFFERENCE([b].Item,'A') >= 1 AND 
      DIFFERENCE([b].Item,'Steve') >= 1

Upvotes: 2

Views: 924

Answers (2)

Gottfried Lesigang
Gottfried Lesigang

Reputation: 67311

With such a lot of rows the only hint I can give you is: Change the design. Each name part should live in a separate column...

The following will work, but I promise it will be slow...

--set up a test db

USE master;
GO
CREATE DATABASE shnugo;
GO
USE shnugo;
GO

--your table, I added an ID-column

create table tbl_pat_soundex
(
    ID INT IDENTITY --needed to distinguish rows
   ,col_str varchar(max)
);
GO

--A function, which will return a blank-separated string as a alphabetically sorted list of distinct soundex values separated by /: "Smith A Steve" comes back as /A000/S310/S530/

CREATE FUNCTION dbo.ComputeSoundex(@str VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
    DECLARE @tmpXML XML=CAST('<x>' + REPLACE((SELECT @str AS [*] FOR XML PATH('')),' ','</x><x>') + '</x>' AS XML);
    RETURN (SELECT DISTINCT '/' + SOUNDEX(x.value('text()[1]','varchar(max)')) AS [se]
            FROM @tmpXML.nodes('/x[text()]') A(x)
            ORDER BY se
            FOR XML PATH(''),TYPE).value('.','nvarchar(max)') + '/';
END
GO

--Add a column to store a computed soundex-chain permanently

ALTER TABLE tbl_pat_soundex ADD SortedSoundExPattern VARCHAR(MAX);
GO

--We need a trigger to maintain the computed soundex-chain on any insert or update

CREATE TRIGGER RefreshComputeSoundex ON tbl_pat_soundex
FOR INSERT,UPDATE
AS
BEGIN
    UPDATE s SET SortedSoundExPattern=dbo.ComputeSoundex(i.col_str)
    FROM tbl_pat_soundex s
    INNER JOIN inserted i ON s.ID=i.ID;
END
GO

--test data

insert into tbl_pat_soundex(col_str) values
 ('Smith A Steve')
,('Steve A Smyth')
,('A Smeeth Stive')
,('Steve Smith A')
,('Smit Steve A')
,('Smit Steve') --no A
,('Smit A') --no Steve
,('Smit Smith Robert Peter A') --add noise
,('Shnugo'); --something else entirely

--check the intermediate result

SELECT * 
FROM tbl_pat_soundex

/*
+----+---------------------------+-----------------------+
| ID | col_str                   | SortedSoundExPattern  |
+----+---------------------------+-----------------------+
| 1  | Smith A Steve             | /A000/S310/S530/      |
+----+---------------------------+-----------------------+
| 2  | Steve A Smyth             | /A000/S310/S530/      |
+----+---------------------------+-----------------------+
| 3  | A Smeeth Stive            | /A000/S310/S530/      |
+----+---------------------------+-----------------------+
| 4  | Steve Smith A             | /A000/S310/S530/      |
+----+---------------------------+-----------------------+
| 5  | Smit Steve A              | /A000/S310/S530/      |
+----+---------------------------+-----------------------+
| 6  | Smit Steve                | /S310/S530/           |
+----+---------------------------+-----------------------+
| 7  | Smit A                    | /A000/S530/           |
+----+---------------------------+-----------------------+
| 8  | Smit Smith Robert Peter A | /A000/P360/R163/S530/ |
+----+---------------------------+-----------------------+
| 9  | Shnugo                    | /S520/                |
+----+---------------------------+-----------------------+
*/

--Now we can start to search:

DECLARE @StringToSearch VARCHAR(MAX)=' A Steve';

WITH SplittedSearchString AS
(
    SELECT soundexCode.value('text()[1]','nvarchar(max)') AS SoundExCode
    FROM (SELECT CAST('<x>' + REPLACE(dbo.ComputeSoundex(@StringToSearch),'/','</x><x>') + '</x>' AS XML)) A(x)
    CROSS APPLY x.nodes('/x[text()]') B(soundexCode)
)
SELECT a.ID,col_str
FROM tbl_pat_soundex a
INNER JOIN SplittedSearchString s On SortedSoundExPattern LIKE '%/' +  s.SoundExCode + '/%'
GROUP BY ID,col_str
HAVING COUNT(ID)=(SELECT COUNT(*) FROM SplittedSearchString)
ORDER BY ID 
GO

--clean-up

USE master;
GO
DROP DATABASE shnugo;

Short explanation

This is how it works:

  • The cte will use the same function to return a soundex-chain of alle the input's fragments
  • The query will then INNER JOIN this with a LIKE test --this will be sloooooow...
  • The final check is, if the number of hits is the same as number of fragments.

And a final hint: If you want to search for an exact match, but you want to include different writings you can just directly compare the two strings. You might even place an index on the new column SortedSoundExPattern. Due to the way of creation all kinds of "Steven A Smith", "Steeven a Smit" and even in differing order like "Smith Steven A" will produce exactly the same pattern.

Upvotes: 3

StepUp
StepUp

Reputation: 38094

In my view, you should try to use dynamic SQL.

For example, you have a table:

create table tbl_pat_soundex
(
    id int,
    col_str varchar(max)
)

And you have an the following clustered index or any other index(table with over 100 million rows should have some index):

CREATE NONCLUSTERED INDEX myIndex ON dbo.tbl_pat_soundex(id) INCLUDE (col_str)*/

So try to create the following dynamic SQL query based on your logic and execute it. The wish result should look like this:

DECLARE @statement NVARCHAR(4000)
SET @statement = N'
SELECT col_str
FROM tbl_pat_soundex
WHERE col_str like '%Smith%' AND id > 0
UNION ALL
SELECT col_str
FROM tbl_pat_soundex
WHERE col_str like '%Steve%' AND id > 0
UNION ALL
SELECT col_str
FROM tbl_pat_soundex
WHERE 
    PATINDEX('%Smith%',col_str) >= 1 AND PATINDEX('%A%',col_str) >= 1 AND 
    PATINDEX('%Steve%',col_str) >= 1
    AND id > 0'

Basically, what we do is creating single search queries which will have index seek and then combine all results.

This query will have index seek as we use predicate id > 0(assuming that all ids are greater than 0 or you can write your own negative number):

SELECT col_str
FROM tbl_pat_soundex
WHERE col_str like '%Smith%' AND id > 0

Upvotes: 0

Related Questions