Chris Baker
Chris Baker

Reputation: 50592

Finding and dealing with duplicate users

In a large user database with the following format and sample data, we are trying to identify duplicated people:

id   first_name    last_name   email
---------------------------------------------------
 1   chris         baker       
 2   chris         baker       [email protected]
 3   chris         baker       [email protected]
 4   chris         baker       [email protected]  
 5   carl          castle      [email protected]
 6   mike          rotch       [email protected]  

I am using the following query:

SELECT 
    GROUP_CONCAT(id) AS "ids",
    CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
    COUNT(*) AS "duplicate_count" 
FROM 
    users 
GROUP BY 
    name 
HAVING 
    duplicate_count > 1

This works great; I get a list of duplicates with the id numbers of the involved rows.

We would re-assign any associated data tied to a duplicate to the actual person (set user_id = 2 where user_id = 3), then we delete the duplicating user row.

The trouble comes after we make this report the first time, as we clean up the list after manually verifying that they are indeed duplicates -- some ARE NOT duplicates. There are 2 Chris Bakers that are legitimate users.

We don't want to keep seeing Chris Baker in subsequent duplicate reports until the end of time, so I am looking for a way to flag that user id 1 and user id 4 are NOT duplicates of each other for future reports, but they could be duplicated by new users added later.

What I tried

I added a is_not_duplicate field to the user table, but then if a new duplicate "Chris Baker" gets added to the database, it will cause this situation to not show on the duplicate report; the is_not_duplicate improperly excludes one of the accounts. My HAVING statement would not meet the > 1 threshold until there are -two- duplicates of Chris Baker, plus the "real" one marked is_not_duplicate.

Question Summed Up

How can I build exceptions into the above query without looping results or multiple queries?

Sub-queries are fine, but the size of the dataset makes every query count and I'd like the solution to be as performant as possible.

Upvotes: 16

Views: 2151

Answers (16)

joocer
joocer

Reputation: 1121

I see someone else has been voted down for the suggestion of merging, but nothing about your problem statement says the data needs to be inplace. The OP followed up with their solution which happens to be a put SQL one, that doesn't imply that every solution needs to be limited to that.

The issue as I understand is around contacts having multiple, similar, but not necessarily identical records in your database, which has cost and reputational implications so you're looking to deduplicate these records.

I would write a batch job that searches for potential duplicates (this can be as complicated or as simple as you like) and then close the two records that it finds are dupes and create a new record.

To enable that you'd need four new columns:

  • Status, which would be either Open, Merged, Split
  • RelatedId, which would hold the value of who the record was merged with
  • ChainId, the new record Id
  • DateStatusChanged, obvious enough

Open would be the default status Merged would be when the record is merged (effectively closed and replaced) Split would be if the merge was reversed

So, as an example, go through all of the records that, for example, have the same name. Merge them in pairs. So if you have three Chris Bakers, records 1, 2 and 3, merge 1 and 2 to make record 4 and then 3 and 4 to make record 5. Your table would end up something like:

ID  NAME        STATUS  RELATEDID  CHAINID DATESTATUSCHANGED [other rows omitted]
 1  Chris Baker MERGED          2        4       27-AUG-2012
 2  Chris Baker MERGED          1        4       27-AUG-2012
 3  Chris Baker MERGED          4        5       28-AUG-2012
 4  Chris Baker MERGED          3        5       28-AUG-2012
 5  Chris Baker   OPEN

This way you have a full record of what has happened to your data can reverse any changes by unmerging, if for example contacts 1 and 2 weren't the same you reverse the merge of 3 and 4, reverse the merge of 1 and 2, you'd end up with this:

ID  NAME        STATUS  RELATEDID  CHAINID DATESTATUSCHANGED
 1  Chris Baker  SPLIT          2        4       29-AUG-2012
 2  Chris Baker  SPLIT          1        4       29-AUG-2012
 3  Chris Baker  SPLIT          4        5       29-AUG-2012
 4  Chris Baker CLOSED          3        5       29-AUG-2012
 5  Chris Baker CLOSED                           29-AUG-2012

You could then manually merge, as you'd probably not want your job to automatically remerge split records.

Upvotes: 0

Seb Rose
Seb Rose

Reputation: 3666

If you were to correct all duplicates each time you run the report, then a very simple solution might be to modify the query:

SELECT 
    GROUP_CONCAT(id) AS "ids",
    MAX(id) AS "max_id",
    CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
    COUNT(*) AS "duplicate_count" 
FROM 
    users 
GROUP BY 
    name 
HAVING 
    duplicate_count > 1
    AND
    max_id > MAX_ID_LAST_TIME_DUPLICATE_REPORT_WAS_GENERATED;

Upvotes: 1

Cninroh
Cninroh

Reputation: 1796

Why don't you make the email column to be a unique identifier in this case, and after you cleanse your records once, you do not allow duplicates from there onwards?

Upvotes: -1

dbenham
dbenham

Reputation: 130819

I gave Justin Pihony +1 as the 1st to suggest comparing the duplicate count with the not duplicate count, and Hrant Khachatrian +1 for being the 1st to show an efficient way of doing that.

Here is a slightly different method, plus some renaming to make everything a bit more self explanatory, plus some extra columns in the query to make it obvious which records need to be compared as potential duplicates.

I would call the new column "CONFIRMED_UNIQUE" instead of "IS_NOT_DUPLICATE". Like Hrant I would make it Boolean (tinyint(1) with 0=FALSE and 1=TRUE).

The "potential_duplicate_count" is the maximum number of records that would have to be deleted.

select
    group_concat(case when not confirmed_unique then id end) as potential_duplicate_ids,
    group_concat(case when confirmed_unique then id end) as confirmed_unique_ids,
    concat(upper(first_name), upper(last_name)) as name,
    sum( case when not confirmed_unique then 1 end ) - (not max(confirmed_unique)) as potential_duplicate_count
from
    users
group by
    name
having
    potential_duplicate_count > 0

Upvotes: 0

mehdi lotfi
mehdi lotfi

Reputation: 11571

Add is_not_duplicate by datatype bit to your table and use below query after set is_not_duplicate data value:

SELECT  GROUP_CONCAT(id) AS "ids",
        CONCAT(UPPER(first_name), UPPER(last_name)) AS "name"
FROM users 
GROUP BY name 
HAVING COUNT(*) > SUM(CAST(is_not_duplicate AS INT))

above query compare total duplicate rows by total valid duplicate rows.

Upvotes: -1

KoU_warch
KoU_warch

Reputation: 2150

I would suggest you to create a couple of things:

  1. A Boolean column to flag confirmed users
  2. A String column to save ids
  3. A trigger that will check if the first name and last name are already there to fill up the flag, and save in the string column all ids to which this one is a possible duplicate.

And then build a report that looks for duplicated true and decode the string field to match the possible duplicated

Upvotes: 0

BendaThierry.com
BendaThierry.com

Reputation: 2110

If I were you, I will add some geolocalisation tables/fields to my database schema.

The probability two end-users are having the same names AND are living in the same place is very very low - except in very big town - but you can split geolocalization to small areas too - it's about granularity.

Good luck.

Upvotes: 0

Pablo Jomer
Pablo Jomer

Reputation: 10378

I think it would make sense to create a lookup-table storing the ids of the ones that are not duplicates. Thus confirmed non duplicants are removed and the query will only have to ad a small look up for duplicates actualy found on the lookup table.

for instance in this example we would have

id 1 | id 2

 2      4

if [email protected] and [email protected] are diffrent persons.

Upvotes: 0

Lawrence Barsanti
Lawrence Barsanti

Reputation: 33232

Is there a good reason for not merging duplicate accounts into a single account?

From the comments, it seems like the information is being used mostly for contact information so merging should be relatively painless and low risk. Once you merge users they will no longer appear in your duplicate report. Furthermore, you users table will actually shrink which could help with performance.

Upvotes: -1

kaefert
kaefert

Reputation: 309

well it seems to me that the is_not_duplicate column is not complex enough to hold the information you want to store - from what I understand you want to manually tell your detection that two distinct users are not duplicates of each other. so either you create a column like is_not_duplicate_of=other-user-id or if you want to keep the possibility open that one user can be manually defined not duplicate of more than one users, you need a seperate table with two user-id columns.

the query telling you the non overridden duplicates probably has to be a bit more complex than the one you suggested, I cannot think of one that works with a group by and having logic. The only thing that would come to my mind is something like

SELECT u1.* FROM users u1
INNER JOIN users u2
ON u1.id <> u2.id
AND u2.name = u1.name
WHERE NOT EXISTS (
  SELECT *
  FROM users_non_dups un
  WHERE (un.id1 = u1.id AND un.id2 = u2.id)
  OR (un.id1 = u2.id AND un.id2 = u1.id)
)

Upvotes: 1

Sameer
Sameer

Reputation: 4389

If you are ok to make a slight change to the format of the report. You could do a self-join like this -

SELECT 
    CONCAT(u1.id,",", u2.id) AS "ids",
    CONCAT(UPPER(u1.first_name), UPPER(u1.last_name)) AS "name"
FROM 
    users u1, users u2
WHERE
    u1.id < u2.id AND
    UPPER(u1.first_name) = UPPER(u2.first_name) AND
    UPPER(u1.last_name) = UPPER(u2.last_name) AND
    CONCAT(u1.id,",", u2.id) NOT IN (SELECT ids from not_dupe)

which reports duplicates as follows:

ids | name
----|--------
1,2 | CHRISBAKER
1,3 | CHRISBAKER
...

And the not_dupe table would have rows like below:

ids
------
1,2
3,4
...

Upvotes: 0

Hrant Khachatrian
Hrant Khachatrian

Reputation: 3109

Try to add the is_not_duplicate boolean field and modify your code as follows:

SELECT 
    GROUP_CONCAT(id) AS "ids",
    CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
    COUNT(*) AS "duplicate_count",
    SUM(is_not_duplicate) AS "real_count"
FROM 
    users 
GROUP BY 
    name 
HAVING 
    duplicate_count > 1
AND
    duplicate_count - real_count > 0

Newly added duplicates will have is_not_duplicate=0 so the real_count for that name will be less than duplicate_count and the row will be shown

Upvotes: 7

Emil Vikstr&#246;m
Emil Vikstr&#246;m

Reputation: 91912

Since this is basically a many-to-many relationship I would add a new table not_duplicate with fields user1 and user2.

I would probably add two rows for each not_duplicate relationship such that I have one row for 2 -> 3 and a symmetric row for 3 -> 2 to ease querying, but that may introduce data inconsistencies so make sure you delete both rows at the same time (or have only one row and make the correct query in your script).

Upvotes: 2

georgepsarakis
georgepsarakis

Reputation: 1957

I am not sure if this will work, but could you consider the reverse logic of adding a *is_duplicate_of* column? That way you can mark duplicates by entering the ID of the first record at this column which will be greater than zero. The records that you wish to retain will have a 0 value at this field. You can set the default (unchecked records) to -1 to keep track of the validation status for each record.

Afterwards you can keep executing an SQL that will compare new records only with correct records having is_duplicate_of = 0 .

Upvotes: 0

Justin Pihony
Justin Pihony

Reputation: 67075

My brain is too fried to come up with the actual query for this at the moment, but I might be able to give you a nudge in a path that should work :)

What if you did add another column (maybe a table of valid duplicated users instead?...both will accomplish the same thing), and ran a subquery that would count up all of the valid duplicates and then you could compare against the count in your current query. You would exclude any users that have matching counts, and would pull in any with counts that are higher. Hopefully that makes sense; I will create a use case:

  • Chris Baker with id 1 and 4 are marked as valid_duplicates
  • There are 4 Chris Baker's in the system
  • You get a count of valid Chris Baker's
  • You get a count of all Chris Baker's
  • valid_count <> total_count, so return Chris Baker

*You probably can even modify the query so that it does not even list the duplicate id's (even if you get a duplicate marking of only 1 id). Rather than having to re-check which are the valids. This would be a little more complicated. Without it, at least you ignore Chris Baker until another enters the system

I have written up the basic query, dealing with excluding specific id's I will try to roll in tonight. But, this at least solves your initial need. If you do not need the more complicated query, do let me know so that I do not waste my time on it :)

SELECT 
    GROUP_CONCAT(id) AS "ids",
    CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
    COUNT(*) AS "duplicate_count" 
FROM 
    users 
WHERE NOT EXISTS
    (
        SELECT 1 
        FROM
        (
            SELECT 
                CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
                COUNT(*) AS "valid_duplicate_count" 
            FROM 
                users
            WHERE 
                is_valid_duplicate = 1 --true
            GROUP BY 
               name 
            HAVING 
               valid_duplicate_count > 1 
        ) AS duplicate_users
        WHERE 
            duplicate_users.name = users.name 
                AND valid_duplicate_count = duplicate_count
    )    
GROUP BY 
    name 
HAVING 
    duplicate_count > 1

Below is the query that should do the same as above, but the final list will only print the id's that are not in the valid list. This actually ended up being a lot simpler than I thought. And, it is mostly the same as above, but the only reason I kept above is to keep the two options and in case I messed the above up...it does get complicated as it is many nested queries. If CTE's are available to you, or even temp tables. It might make the query more expressive to break it up into temp tables :). Hopefully this helps and is what you are looking for

SELECT GROUP_CONCAT(id) AS "ids", 
    CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
    COUNT(*) AS "final_duplicate_count" 
    --This count could actually be 1 due to the nature of the query 
FROM 
    users
--get the list of duplicated user names
WHERE EXISTS
    (
        SELECT 
            CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
            COUNT(*) AS "total_duplicate_count"
        FROM 
            users AS total_dup_users
        --ignore valid_users whose count still matches
        WHERE NOT EXISTS
            (
                SELECT 1 
                FROM
                (
                    SELECT 
                        CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
                        COUNT(*) AS "valid_duplicate_count" 
                    FROM 
                        users AS valid_users
                    WHERE 
                        is_valid_duplicate = 1 --true
                    GROUP BY 
                        name 
                    HAVING 
                        valid_duplicate_count > 1 
                ) AS duplicate_users
                WHERE 
                    --join inner table to outer table
                    duplicate_users.name = total_dup_users.name  
                        --valid count check
                        AND valid_duplicate_count = total_duplicate_count
            )   
            --join inner table to outer table
            AND total_dup_users.Name = users.Name 
        GROUP BY 
            name 
        HAVING 
            duplicate_count > 1
    ) 
    --ignore users that are valid when doing the actual counts
    AND NOT EXISTS
    (
        SELECT 1
        FROM users AS valid
        WHERE 
            --join inner table to outer table
            users.name = 
                CONCAT(UPPER(valid.first_name), UPPER(valid.last_name))
            --only valid users
            AND valid.is_valid_duplicate = 1 --true
    )
GROUP BY 
    FinalDuplicates.Name

Upvotes: 2

cyrusv
cyrusv

Reputation: 247

I would go ahead and make the "confirmed_unique" column, defaulted as "False."

In order to avoid the problems you mentioned,

Then I would select all elements that may look like duplicates and have a "False" entry for "confirmed_unique."

Upvotes: 0

Related Questions