Search: Match all words from multiple columns

Question

I'm trying to add lots (1700) of names to an existing database, but need to check for duplicates. In fact, we assume most are duplicates. Unfortunately, the names are from address labels, and are not separated by fields (some are organization names, some are people names). To lighten the load on a human, I'd like to search for good matches on the name first. By good match, I mean I want all the words in the name (John Julie Smith) to be matched across several db fields (title, firstname, lastname, suffix, spousename). So if firstname is John, lastname is Smith and spousename is Julie, this would match, or if firstname (in the db) is "John Julie" and lastname is "Smith", that would also match.

I'm halfway through a script that will do this all in PHP and run a separate query for every single possibility. like lastname = 'john julie smith', firstname = 'john julie smith'... lastname = 'john julie' AND firstname = 'smith' etc etc etc! That's 105 queries for a three-word name, and I have 1700 names to process. That sounds rediculous to me.

PHP I know fairly well, but I'm not great with MySQL. Is there a query that can try to match all the words across multiple columns? Even if it only handles one of the name combinations ("John, Julie, Smith" or "John Julie, Smith"). Maybe even using Regex?

Here's where I am on this.

foreach( $a as $name ) {
    //There's some more stuff up here to prepare the strings,
    //removing &/and, punctuation, making everything lower case...

    $na = explode( " ", $name );

    $divisions = count( $na ) - 1;
    $poss = array();
    for( $i = 0; $i < pow(2, $divisions); $i++ ) {
        $div = str_pad(decbin($i), $divisions, '0', STR_PAD_LEFT);
        $tpa = array();
        $tps = '';
        foreach($na as $nak => $nav) {
            if ( $nak > 0 && substr( $div, $nak - 1, 1 ) ) {
                $tpa[] = $tps;
                $tps = $nav;
            } else {
                $tps = trim( $tps . ' ' . $nav );
            }
        }
        $tpa[] = $tps;
        $poss[] = $tpa;
    }
    foreach( $poss as $possk => $possv ) {
        $count = count( $possv );
        //Here's where I am... 
        //I could use $count and some math to come up with all the possible searches here,
        //But my head is starting to spin as I try to think of how to do that.
    }

    die();
}

So far the PHP creates an array ($poss) with all the possible arrangements of the words in the name string. For "John Julie Smith", the array looks like this:

Array
(
    [0] => Array
        (
            [0] => john julie smith
        )

    [1] => Array
        (
            [0] => john julie
            [1] => smith
        )

    [2] => Array
        (
            [0] => john
            [1] => julie smith
        )

    [3] => Array
        (
            [0] => john
            [1] => julie
            [2] => smith
        )

)

The original idea was to iterate through the array and create a bazillion queries. For [0], there would be 5 queries:

... WHERE firstname = 'john julie smith';
... WHERE lastname = 'john julie smith';
... WHERE spousename = 'john julie smith';
... WHERE title = 'john julie smith';
... WHERE suffix = 'john julie smith';

but for [1] there would be 20 queries:

... WHERE firstname = 'john julie' AND lastname = 'smith';
... WHERE firstname = 'john julie' AND spousename = 'smith';
... WHERE firstname = 'john julie' AND title = 'smith';
... WHERE firstname = 'john julie' AND lastname = 'smith';
... WHERE firstname = 'john julie' AND suffix = 'smith';
... WHERE lastname = 'john julie' AND firstname = 'smith';
... WHERE lastname = 'john julie' AND spousename = 'smith';
... WHERE lastname = 'john julie' AND title = 'smith';
... WHERE lastname = 'john julie' AND lastname = 'smith';
... WHERE lastname = 'john julie' AND suffix = 'smith';
//and on and on

For [3] there would be 60 queries! I'm looking at 170,000+ queries at this rate!

There must be a better way...

Gordon Linoff · Accepted Answer

Load the 1700 names into a table in MySQL.

Then, I think the following approach will help. Look for matches in the fields and order the rows by the ones with the most matches. This is not 100% perfect, I suspect it will be a bit help. The query is:

select n.name, t.*,
       (n.name like concat('%', firstname, '%') +
        n.name like concat('%', lastname, '%') +
        n.name like concat('%', suffix, '%') +
        n.name like concat('%', spousename, '%')
       ) as NumMatches
from table t join
     names n
     on n.name like concat('%', firstname, '%') or
        n.name like concat('%', lastname, '%') or
        n.name like concat('%', suffix, '%') or
        n.name like concat('%', spousename, '%')
group by t.firstname, t.lastname, t.suffix, t.spousename, n.name
order by NumMatches;

EDIT:

I left this out the first time, but you can count the number of words in each name and the number of matches. Put this clause before the order by:

having NumMatches = length(n.name) - length(replace(n.n, ' ', '')

This still isn't perfect, because the same name could be in multiple fields. In practice, it should work pretty well. If you want to be more pedantic, you can do something like:

having concat_ws(':', firstname, lastname, suffice, spousename) like concat('%', substring_index(n.name, ' ', 1), '%') and
       concat_ws(':', firstname, lastname, suffice, spousename) like concat('%', substring_index(substring_index(n.name, ' ', 2), ' ', -1), '%') and
       concat_ws(':', firstname, lastname, suffice, spousename) like concat('%', substring_index(substring_index(n.name, ' ', 3), ' ', -1), '%') and
       concat_ws(':', firstname, lastname, suffice, spousename) like concat('%', substring_index(substring_index(n.name, ' ', 4), ' ', -1), '%')

This will test each name independently.

Search: Match all words from multiple columns

Answers (1)

Related Questions