grady
grady

Reputation: 12785

How do I check if one column value is contained in another column value (TSQL)?

I have 2 tables with many columns and I want to find those rows where the value from table1.somecolumn is contained in table2.someothercolumn. Example:

table1.somecolumn has Smith, Peter and
table2.someothercolumn has peter.smith

That should be a match, how would I do such a search?

Upvotes: 2

Views: 2494

Answers (3)

questionto42
questionto42

Reputation: 9638

If it follows a strict pattern so that there is always a comma and spaces between the names, you can also split the string by comma and space, see STRING_SPLIT (Transact-SQL).

string_split() on 'Smith, Peter' and an ID column would become:

ID String
1  Smith
1  
1  Peter

Then check with a select somestring like '%'+someotherstring+'%' (and lower all strings) whether all of the String:s of one ID can be found somewhere in another table's column.

I do not take the time to check this further, but you can get there step by step. I do not think that this is a helpful answer since you should rather check the fuzzy names with tools like the fuzzy logic in SSIS or other (external) components for SSIS. Also search for stored procedures. And check whether you can get this done with code of another programming language.

Upvotes: 0

Rabid
Rabid

Reputation: 3034

You can try either SOUNDEX or DIFFERENCE functions to help match string literals.

Example:

select difference('peter.green', 'Green, Peter')

returns 2, whereby:

The integer returned is the number of characters in the SOUNDEX values that are the same. The return value ranges from 0 through 4: 0 indicates weak or no similarity, and 4 indicates strong similarity or the same values.

See SOUNDEX and DIFFERENCE topics on MSDN.

Update:

Soundex & Difference may not function well when the order of words are considered, but if you have full text indexing capabilities installed, you don't need to create an index to use the word breaking and parsing capabilities of the full text engine. Assuming you're using SQL Server 2008, the following function will return you a list of normalised terms:

SELECT * FROM sys.dm_fts_parser('"Peter Green"', 1033, 0, 0)

Through which you can CROSS APPLY to the remainder of your query.

See the sys.dm_fts_parser topic & Section K. Using Apply within the FROM topic for more info.

Example: (SQL Server Enterprise 2008 with Full Text Engine enabled)

if not OBJECT_ID('Names1', 'Table') is null drop table names1
if not OBJECT_ID('Names2', 'Table') is null drop table names2

create table Names1 
(
    id int identity(0, 1),
    name nvarchar(128)
)
insert into Names1 (name) values ('Green, Peter')
insert into Names1 (name) values ('Smith, Peter')
insert into Names1 (name) values ('Aadland, Beverly')
insert into Names1 (name) values ('Aalda, Mariann')
insert into Names1 (name) values ('Aaliyah')
insert into Names1 (name) values ('Aames, Angela')
insert into Names1 (name) values ('Aames, Willie')
insert into Names1 (name) values ('Aaron, Caroline')
insert into Names1 (name) values ('Aaron, Quinton')
insert into Names1 (name) values ('Aaron, Victor')
insert into Names1 (name) values ('Abbay, Peter')
insert into Names1 (name) values ('Abbott, Dorothy')
insert into Names1 (name) values ('Abbott, Bruce')
insert into Names1 (name) values ('Abbott, Bud')
insert into Names1 (name) values ('Abbott, Philip')
insert into Names1 (name) values ('Abdoo, Rose')
insert into Names1 (name) values ('Abdul, Paula')
insert into Names1 (name) values ('Abel, Jake')
insert into Names1 (name) values ('Abel, Walter')
insert into Names1 (name) values ('Abeles, Edward')
insert into Names1 (name) values ('Abell, Tim')
insert into Names1 (name) values ('Aber, Chuck')

create table Names2
(
    id int identity(200, 1),
    name nvarchar(128)
)
insert into Names2 (name) values (LOWER('Peter.Green'))
insert into Names2 (name) values (LOWER('Peter.Smith'))
insert into names2 (name) values (LOWER('Beverly.Aadland'))
insert into names2 (name) values (LOWER('Mariann.Aalda'))
insert into names2 (name) values (LOWER('Aaliyah'))
insert into names2 (name) values (LOWER('Angela.Aames'))
insert into names2 (name) values (LOWER('Willie.Aames'))
insert into names2 (name) values (LOWER('Caroline.Aaron'))
insert into names2 (name) values (LOWER('Quinton.Aaron'))
insert into names2 (name) values (LOWER('Victor.Aaron'))
insert into names2 (name) values (LOWER('Peter.Abbay'))
insert into names2 (name) values (LOWER('Dorothy.Abbott'))
insert into names2 (name) values (LOWER('Bruce.Abbott'))
insert into names2 (name) values (LOWER('Bud.Abbott'))
insert into names2 (name) values (LOWER('Philip.Abbott'))
insert into names2 (name) values (LOWER('Rose.Abdoo'))
insert into names2 (name) values (LOWER('Paula.Abdul'))
insert into names2 (name) values (LOWER('Jake.Abel'))
insert into names2 (name) values (LOWER('Walter.Abel'))
insert into names2 (name) values (LOWER('Edward.Abeles'))
insert into names2 (name) values (LOWER('Tim.Abell'))
insert into names2 (name) values (LOWER('Chuck.Aber'));

with ftsNamesFirst (id, term) as
(
    select id, terms.display_term
        from names1 cross apply sys.dm_fts_parser('"' + name + '"', 1033, 0, 0) terms
), ftsNamesSecond (id, term) as
(
select id, terms.display_term
        from names2 cross apply sys.dm_fts_parser('"' + name + '"', 1033, 0, 0) terms
)
select * from 
(
    select 
    ROW_NUMBER() over (partition by nfirst.id order by sum(DIFFERENCE(ftsNamesFirst.term, ftsNamesSecond.term)) desc) ranking,
    sum(DIFFERENCE(ftsNamesFirst.term, ftsNamesSecond.term)) Confidence,
    nFirst.id Names1ID,
    nFirst.name Names1Name, 
    nSecond.id Names2ID,
    nSecond.name Names2Name
    from 
    ftsNamesFirst cross join ftsNamesSecond 
    left outer join names1 nFirst on nFirst.id = ftsNamesFirst.id
    left outer join names2 nSecond on nSecond.id = ftsNamesSecond.id 
    where DIFFERENCE(ftsNamesFirst.term, ftsNamesSecond.term) = 4
    group by 
        nFirst.id, nFirst.name, nSecond.id, nSecond.name
) MatchedNames 
where ranking = 1

Outputs:

Where the matches with highest confidence take precedence (all others are filtered out using a windowed ranking query).

Confidence Names1ID Names1Name Names2ID Names2Name
8   0   Green, Peter    200 peter.green
8   1   Smith, Peter    201 peter.smith
8   2   Aadland, Beverly    202 beverly.aadland
8   3   Aalda, Mariann  203 mariann.aalda
4   4   Aaliyah 204 aaliyah
8   5   Aames, Angela   205 angela.aames
8   6   Aames, Willie   206 willie.aames

It's not perfect, but this is a nice starting point from where it can be tweaked to give a higher probability of success.

Upvotes: 2

burnall
burnall

Reputation: 842

There are several possible solutions depending on exactly what you need: Use could create auxiliary table to store keyword for every record

  1. Use auxiliary table to store keywords for every record or record and field. E.g. table_helper(id int primary key, record_id int, keyword varchar), record_id - link to source table. Fill this table in trigger for table1, table2. Query to find common rows is a simple intersection of table_helper with itself. You could create one helper for table1 and table2 or use separate tables.
  2. Use full-text indexes.

Upvotes: 1

Related Questions