Reputation: 7260
Want to search the string using PATINDEX and SOUNDEX within the WHERE
clause or any optimal way.
I have the following table with some sample data to search the given string using PATINDEX
and SOUNDEX
.
create table tbl_pat_soundex
(
col_str varchar(max)
);
insert into tbl_pat_soundex values('Smith A Steve');
insert into tbl_pat_soundex values('Steve A Smyth');
insert into tbl_pat_soundex values('A Smeeth Stive');
insert into tbl_pat_soundex values('Steve Smith A');
insert into tbl_pat_soundex values('Smit Steve A');
Note: I have 100 Millions
of records in the table to search for.
String to search:- 'Smith A Steve'
SELECT col_str
FROM tbl_pat_soundex
WHERE PATINDEX('%Smith%',col_str) >= 1 AND PATINDEX('%A%',col_str) >= 1 AND PATINDEX('%Steve%',col_str) >= 1
Getting Output:
col_str
--------------
Smith A Steve
Steve Smith A
Expected Output:
col_str
----------------
Smith A Steve
Steve A Smyth
A Smeeth Stive
Steve Smith A
Smit Steve A
1:
SELECT col_str
FROM tbl_pat_soundex
WHERE PATINDEX('%Smith%',col_str) >= 1 AND
PATINDEX('%A%',col_str) >= 1 AND
PATINDEX('%Steve%',col_str) >= 1
2:
SELECT col_str
FROM tbl_pat_soundex
WHERE PATINDEX('%'+SOUNDEX('Smith')+'%',SOUNDEX(col_str)) >= 1 AND
PATINDEX('%'+SOUNDEX('A')+'%',SOUNDEX(col_str)) >= 1 AND
PATINDEX('%'+SOUNDEX('Steve')+'%',SOUNDEX(col_str)) >= 1
3:
SELECT col_str
FROM tbl_pat_soundex
WHERE DIFFERENCE('Smith',col_str) = 4 AND
DIFFERENCE('A',col_str) =4 AND
DIFFERENCE('Steve',col_str) = 4
4:
--Following was taking huge time(was kept running more than 20 minutes) to execute.
SELECT DISTINCT col_str
FROM tbl_pat_soundex [a]
CROSS APPLY SplitString([a].[col_str], ' ') [b]
WHERE DIFFERENCE([b].Item,'Smith') >= 1 AND
DIFFERENCE([b].Item,'A') >= 1 AND
DIFFERENCE([b].Item,'Steve') >= 1
Upvotes: 2
Views: 924
Reputation: 67311
With such a lot of rows the only hint I can give you is: Change the design. Each name part should live in a separate column...
The following will work, but I promise it will be slow...
--set up a test db
USE master;
GO
CREATE DATABASE shnugo;
GO
USE shnugo;
GO
--your table, I added an ID-column
create table tbl_pat_soundex
(
ID INT IDENTITY --needed to distinguish rows
,col_str varchar(max)
);
GO
--A function, which will return a blank-separated string as a alphabetically sorted list of distinct soundex values separated by /
: "Smith A Steve" comes back as /A000/S310/S530/
CREATE FUNCTION dbo.ComputeSoundex(@str VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE @tmpXML XML=CAST('<x>' + REPLACE((SELECT @str AS [*] FOR XML PATH('')),' ','</x><x>') + '</x>' AS XML);
RETURN (SELECT DISTINCT '/' + SOUNDEX(x.value('text()[1]','varchar(max)')) AS [se]
FROM @tmpXML.nodes('/x[text()]') A(x)
ORDER BY se
FOR XML PATH(''),TYPE).value('.','nvarchar(max)') + '/';
END
GO
--Add a column to store a computed soundex-chain permanently
ALTER TABLE tbl_pat_soundex ADD SortedSoundExPattern VARCHAR(MAX);
GO
--We need a trigger to maintain the computed soundex-chain on any insert or update
CREATE TRIGGER RefreshComputeSoundex ON tbl_pat_soundex
FOR INSERT,UPDATE
AS
BEGIN
UPDATE s SET SortedSoundExPattern=dbo.ComputeSoundex(i.col_str)
FROM tbl_pat_soundex s
INNER JOIN inserted i ON s.ID=i.ID;
END
GO
--test data
insert into tbl_pat_soundex(col_str) values
('Smith A Steve')
,('Steve A Smyth')
,('A Smeeth Stive')
,('Steve Smith A')
,('Smit Steve A')
,('Smit Steve') --no A
,('Smit A') --no Steve
,('Smit Smith Robert Peter A') --add noise
,('Shnugo'); --something else entirely
--check the intermediate result
SELECT *
FROM tbl_pat_soundex
/*
+----+---------------------------+-----------------------+
| ID | col_str | SortedSoundExPattern |
+----+---------------------------+-----------------------+
| 1 | Smith A Steve | /A000/S310/S530/ |
+----+---------------------------+-----------------------+
| 2 | Steve A Smyth | /A000/S310/S530/ |
+----+---------------------------+-----------------------+
| 3 | A Smeeth Stive | /A000/S310/S530/ |
+----+---------------------------+-----------------------+
| 4 | Steve Smith A | /A000/S310/S530/ |
+----+---------------------------+-----------------------+
| 5 | Smit Steve A | /A000/S310/S530/ |
+----+---------------------------+-----------------------+
| 6 | Smit Steve | /S310/S530/ |
+----+---------------------------+-----------------------+
| 7 | Smit A | /A000/S530/ |
+----+---------------------------+-----------------------+
| 8 | Smit Smith Robert Peter A | /A000/P360/R163/S530/ |
+----+---------------------------+-----------------------+
| 9 | Shnugo | /S520/ |
+----+---------------------------+-----------------------+
*/
--Now we can start to search:
DECLARE @StringToSearch VARCHAR(MAX)=' A Steve';
WITH SplittedSearchString AS
(
SELECT soundexCode.value('text()[1]','nvarchar(max)') AS SoundExCode
FROM (SELECT CAST('<x>' + REPLACE(dbo.ComputeSoundex(@StringToSearch),'/','</x><x>') + '</x>' AS XML)) A(x)
CROSS APPLY x.nodes('/x[text()]') B(soundexCode)
)
SELECT a.ID,col_str
FROM tbl_pat_soundex a
INNER JOIN SplittedSearchString s On SortedSoundExPattern LIKE '%/' + s.SoundExCode + '/%'
GROUP BY ID,col_str
HAVING COUNT(ID)=(SELECT COUNT(*) FROM SplittedSearchString)
ORDER BY ID
GO
--clean-up
USE master;
GO
DROP DATABASE shnugo;
This is how it works:
INNER JOIN
this with a LIKE
test --this will be sloooooow...And a final hint: If you want to search for an exact match, but you want to include different writings you can just directly compare the two strings. You might even place an index on the new column SortedSoundExPattern
. Due to the way of creation all kinds of "Steven A Smith", "Steeven a Smit" and even in differing order like "Smith Steven A" will produce exactly the same pattern.
Upvotes: 3
Reputation: 38094
In my view, you should try to use dynamic SQL.
For example, you have a table:
create table tbl_pat_soundex
(
id int,
col_str varchar(max)
)
And you have an the following clustered index or any other index(table with over 100 million rows should have some index):
CREATE NONCLUSTERED INDEX myIndex ON dbo.tbl_pat_soundex(id) INCLUDE (col_str)*/
So try to create the following dynamic SQL query based on your logic and execute it. The wish result should look like this:
DECLARE @statement NVARCHAR(4000)
SET @statement = N'
SELECT col_str
FROM tbl_pat_soundex
WHERE col_str like '%Smith%' AND id > 0
UNION ALL
SELECT col_str
FROM tbl_pat_soundex
WHERE col_str like '%Steve%' AND id > 0
UNION ALL
SELECT col_str
FROM tbl_pat_soundex
WHERE
PATINDEX('%Smith%',col_str) >= 1 AND PATINDEX('%A%',col_str) >= 1 AND
PATINDEX('%Steve%',col_str) >= 1
AND id > 0'
Basically, what we do is creating single search queries which will have index seek and then combine all results.
This query will have index seek as we use predicate id > 0
(assuming that all ids are greater than 0
or you can write your own negative number):
SELECT col_str
FROM tbl_pat_soundex
WHERE col_str like '%Smith%' AND id > 0
Upvotes: 0