user2075215
user2075215

Reputation: 379

SQL where two fields are similar using "GROUP BY"

I have a database of names, some of the names haven't been insert in the correct fashion. SecondName has sometimes been entered as FirstName.

+-----------------+--------------+
|    FirstName    |  SecondName  |
+-----------------+--------------+
| Bob             | Smith        |
| Gary            | Rose         |
| John            | Jones        |
| Smith           | Bob          |
| Gary            | Oberstein    |
| Adam            | Sorbet       |
| Jones           | John         |
+-----------------+--------------+

I've tried different grouping queries

select `FirstName`
     , `SecondName`
  from `names`
 where ( `FirstName` 
       , `SecondName` ) 
    in ( select `FirstName`
              , `SecondName`
           from `names`
         group
             by `FirstName`
              , `SecondName`
         having count(*) > 1 
       )

But I can't get anything to produce

+-----------------+--------------+---------+
|    FirstName    |  SecondName  |  Count  |
+-----------------+--------------+---------+
| Bob             | Smith        | 2       |
| John            | Jones        | 2       |
+-----------------+--------------+---------+

Upvotes: 1

Views: 47

Answers (2)

Aaron Dietz
Aaron Dietz

Reputation: 10277

If the COUNT() itself isn't important, you can easily separate duplicates with an INNER JOIN

SELECT n.FirstName, n.SecondName, n2.FirstName, n2.SecondName
FROM Names n
INNER JOIN Names n2 on n.FirstName = n2.SecondName and n.SecondName = n2.FirstName

Upvotes: 2

Hogan
Hogan

Reputation: 70523

There is a trick to do this, you need to normalize your names, a quick way to do this is if you alphabetize first name and last name then group on the result.

SELECT name_normalized, count(*) as c
FROM (
  SELECT CASE WHEN FIRSTNAME < LASTNAME THEN FIRSTNAME||LASTNAME
              ELSE LASTNAME|| FIRSTNAME END as name_normalized
  FROM names
) X
GROUP BY name_normalized

Notes:

  • This is the simple case, you could add the normalized result as a column if you want to see the original values.
  • You may need other normalization -- it depends on what your rules are. For example UPPER() to ignore case and TRIM() to remove whitespace.
  • You can add or ignore other columns as is required for matching normalization -- Birthday, Middle Initial etc.
  • Oten time a hash on the normalized string is faster to work with than the string -- your data model might require one or the other.

Upvotes: 4

Related Questions