Reputation: 12785
I have 2 tables with many columns and I want to find those rows where the value from table1.somecolumn is contained in table2.someothercolumn. Example:
table1.somecolumn has Smith, Peter and
table2.someothercolumn has peter.smith
That should be a match, how would I do such a search?
Upvotes: 2
Views: 2494
Reputation: 9638
If it follows a strict pattern so that there is always a comma and spaces between the names, you can also split the string by comma and space, see STRING_SPLIT (Transact-SQL).
string_split()
on 'Smith, Peter' and an ID
column would become:
ID String
1 Smith
1
1 Peter
Then check with a select somestring like '%'+someotherstring+'%'
(and lower
all strings) whether all of the String
:s of one ID
can be found somewhere in another table's column.
I do not take the time to check this further, but you can get there step by step. I do not think that this is a helpful answer since you should rather check the fuzzy names with tools like the fuzzy logic in SSIS or other (external) components for SSIS. Also search for stored procedures. And check whether you can get this done with code of another programming language.
Upvotes: 0
Reputation: 3034
You can try either SOUNDEX
or DIFFERENCE
functions to help match string literals.
Example:
select difference('peter.green', 'Green, Peter')
returns 2
, whereby:
The integer returned is the number of characters in the SOUNDEX values that are the same. The return value ranges from 0 through 4: 0 indicates weak or no similarity, and 4 indicates strong similarity or the same values.
See SOUNDEX and DIFFERENCE topics on MSDN.
Update:
Soundex & Difference may not function well when the order of words are considered, but if you have full text indexing capabilities installed, you don't need to create an index to use the word breaking and parsing capabilities of the full text engine. Assuming you're using SQL Server 2008, the following function will return you a list of normalised terms:
SELECT * FROM sys.dm_fts_parser('"Peter Green"', 1033, 0, 0)
Through which you can CROSS APPLY
to the remainder of your query.
See the sys.dm_fts_parser topic & Section K. Using Apply within the FROM topic for more info.
Example: (SQL Server Enterprise 2008 with Full Text Engine enabled)
if not OBJECT_ID('Names1', 'Table') is null drop table names1
if not OBJECT_ID('Names2', 'Table') is null drop table names2
create table Names1
(
id int identity(0, 1),
name nvarchar(128)
)
insert into Names1 (name) values ('Green, Peter')
insert into Names1 (name) values ('Smith, Peter')
insert into Names1 (name) values ('Aadland, Beverly')
insert into Names1 (name) values ('Aalda, Mariann')
insert into Names1 (name) values ('Aaliyah')
insert into Names1 (name) values ('Aames, Angela')
insert into Names1 (name) values ('Aames, Willie')
insert into Names1 (name) values ('Aaron, Caroline')
insert into Names1 (name) values ('Aaron, Quinton')
insert into Names1 (name) values ('Aaron, Victor')
insert into Names1 (name) values ('Abbay, Peter')
insert into Names1 (name) values ('Abbott, Dorothy')
insert into Names1 (name) values ('Abbott, Bruce')
insert into Names1 (name) values ('Abbott, Bud')
insert into Names1 (name) values ('Abbott, Philip')
insert into Names1 (name) values ('Abdoo, Rose')
insert into Names1 (name) values ('Abdul, Paula')
insert into Names1 (name) values ('Abel, Jake')
insert into Names1 (name) values ('Abel, Walter')
insert into Names1 (name) values ('Abeles, Edward')
insert into Names1 (name) values ('Abell, Tim')
insert into Names1 (name) values ('Aber, Chuck')
create table Names2
(
id int identity(200, 1),
name nvarchar(128)
)
insert into Names2 (name) values (LOWER('Peter.Green'))
insert into Names2 (name) values (LOWER('Peter.Smith'))
insert into names2 (name) values (LOWER('Beverly.Aadland'))
insert into names2 (name) values (LOWER('Mariann.Aalda'))
insert into names2 (name) values (LOWER('Aaliyah'))
insert into names2 (name) values (LOWER('Angela.Aames'))
insert into names2 (name) values (LOWER('Willie.Aames'))
insert into names2 (name) values (LOWER('Caroline.Aaron'))
insert into names2 (name) values (LOWER('Quinton.Aaron'))
insert into names2 (name) values (LOWER('Victor.Aaron'))
insert into names2 (name) values (LOWER('Peter.Abbay'))
insert into names2 (name) values (LOWER('Dorothy.Abbott'))
insert into names2 (name) values (LOWER('Bruce.Abbott'))
insert into names2 (name) values (LOWER('Bud.Abbott'))
insert into names2 (name) values (LOWER('Philip.Abbott'))
insert into names2 (name) values (LOWER('Rose.Abdoo'))
insert into names2 (name) values (LOWER('Paula.Abdul'))
insert into names2 (name) values (LOWER('Jake.Abel'))
insert into names2 (name) values (LOWER('Walter.Abel'))
insert into names2 (name) values (LOWER('Edward.Abeles'))
insert into names2 (name) values (LOWER('Tim.Abell'))
insert into names2 (name) values (LOWER('Chuck.Aber'));
with ftsNamesFirst (id, term) as
(
select id, terms.display_term
from names1 cross apply sys.dm_fts_parser('"' + name + '"', 1033, 0, 0) terms
), ftsNamesSecond (id, term) as
(
select id, terms.display_term
from names2 cross apply sys.dm_fts_parser('"' + name + '"', 1033, 0, 0) terms
)
select * from
(
select
ROW_NUMBER() over (partition by nfirst.id order by sum(DIFFERENCE(ftsNamesFirst.term, ftsNamesSecond.term)) desc) ranking,
sum(DIFFERENCE(ftsNamesFirst.term, ftsNamesSecond.term)) Confidence,
nFirst.id Names1ID,
nFirst.name Names1Name,
nSecond.id Names2ID,
nSecond.name Names2Name
from
ftsNamesFirst cross join ftsNamesSecond
left outer join names1 nFirst on nFirst.id = ftsNamesFirst.id
left outer join names2 nSecond on nSecond.id = ftsNamesSecond.id
where DIFFERENCE(ftsNamesFirst.term, ftsNamesSecond.term) = 4
group by
nFirst.id, nFirst.name, nSecond.id, nSecond.name
) MatchedNames
where ranking = 1
Outputs:
Where the matches with highest confidence take precedence (all others are filtered out using a windowed ranking query).
Confidence Names1ID Names1Name Names2ID Names2Name
8 0 Green, Peter 200 peter.green
8 1 Smith, Peter 201 peter.smith
8 2 Aadland, Beverly 202 beverly.aadland
8 3 Aalda, Mariann 203 mariann.aalda
4 4 Aaliyah 204 aaliyah
8 5 Aames, Angela 205 angela.aames
8 6 Aames, Willie 206 willie.aames
It's not perfect, but this is a nice starting point from where it can be tweaked to give a higher probability of success.
Upvotes: 2
Reputation: 842
There are several possible solutions depending on exactly what you need: Use could create auxiliary table to store keyword for every record
Upvotes: 1