Shaun
Shaun

Reputation: 2052

Querying for swapped columns in SQL database

We've had a few cases of people entering in first names where last names should be and vice versa. So I'm trying to come up with a SQL search to match the swapped columns. For example, someone may have entered the record as first_name = Smith, last_name = John by accident. Later, another person may see that John Smith is not in the database and enter a new user as first_name = John, last_name = Smith, when in fact it is the same person.

I used this query to help narrow my search:

SELECT person_id, first_name, last_name
FROM people
WHERE first_name IN (
    SELECT last_name FROM people
) AND last_name IN (
    SELECT first_name FROM people
);

But if we have people named John Allen, Allen Smith, and Smith John, they would all be returned even though none of those are actually duplicates. In this case, it's actually good enough that I can see the duplicates in my particular data set, but I'm wondering if there's a more precise way to do this.

Upvotes: 0

Views: 65

Answers (2)

Gordon Linoff
Gordon Linoff

Reputation: 1270653

Interesting. This is a problem that I cover in Data Analysis Using SQL and Excel (note: I only very rarely mention books in my answers or comments).

The idea is to summarize the data to get a likelihood of a mismatch. So, look at the number of times a name appears as a first name and as a last name and then combine these. So:

with names as (
      select first_name as name, 1.0 as isf, 0.0 as isl
      from people
      union all
      select last_name, 0, 1
      from people
     ),
     nl as (
      select name, sum(isf) as numf, sum(isl) as numl,
             avg(isf) as p_f, avg(isl) as p_l
      from names
      group by name
     )
select p.*
from people p join
     nl nlf
     on p.first_name = nlf.name join
     nl nll
     on p.last_name = nll.name
order by (coalesce(nlf.p_l, 0) + coalesce(nll.p_f, 0));

This orders the records by a measure of mismatch of the names -- the sum of the probabilities of the first name used by a last name and a last name used as a first name.

Upvotes: 0

Hogan
Hogan

Reputation: 70538

I would do a self join like this:

SELECT p1.person_id, p1.first_name, p1.last_name
FROM people p1
join people p2 on p1.first_name = p2.last_name and p1.last_name = p2.first_name

To also find typos on names I recommend this:

SELECT p1.person_id, p1.first_name, p1.last_name
FROM people p1
join people p2 on soundex(p1.first_name) = soundex(p2.last_name) and
                  soundex(p1.last_name) = soundex(p2.first_name)

soundex is a neat function that "hashes" words in a way that two words that sound the same get the same hash. This means Anne and Ann will have the same soundex. So if you had an Anne Smith and a Smith Ann the query above would find them as a match.

Upvotes: 1

Related Questions