Jack M.
Jack M.

Reputation: 32090

Index for finding duplicate rows?

I'm trying to find duplicates in my users table (don't ask, it's a lot of red tape), but I'm having a problem creating an index for the query I've created. The table looks like:

+----------------+---------+------+-----+---------+----------------+
| Field          | Type    | Null | Key | Default | Extra          |
+----------------+---------+------+-----+---------+----------------+
| id             | int(10) | NO   | PRI | NULL    | auto_increment | 
| email          | text    | YES  | MUL | NULL    |                | 
| username       | text    | YES  | MUL | NULL    |                | 
| password       | text    | YES  |     | NULL    |                |
+----------------+---------+------+-----+---------+----------------+

There are other fields, but these are what I am searching off of. The query I wrote for finding the duplicates looks like:

SELECT COUNT(username) count,GROUP_CONCAT(id) ids,username,email,password
    FROM users
    GROUP BY username,email,password
    HAVING COUNT(username) > 1

And the index I created is:

CREATE INDEX users_id_username_password_email
    ON users id,username(64),password(64),email(64));

Unfortunately, describe doesn't seem to be using this index:

mysql> describe SELECT COUNT(username) count,GROUP_CONCAT(id) ids,
    -> username,email,password
    -> FROM users
    -> GROUP BY username,email,password
    -> HAVING COUNT(username) > 1\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: users
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 705418
        Extra: Using filesort

So the root question is, what kind of index should I be creating to find duplicate entries on a table such as this?

Edit: Changing the order of the query to match the index did nothing:

mysql> describe SELECT COUNT(username) count,GROUP_CONCAT(id) ids,
    -> username,password,email
    -> FROM users
    -> GROUP BY username,password,email
    -> HAVING COUNT(username) > 1\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: users
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 705418
        Extra: Using filesort

Upvotes: 1

Views: 132

Answers (1)

Bill Karwin
Bill Karwin

Reputation: 563011

There's no point for the RDBMS to use an index if it needs to read every row anyway. The presence of an index doesn't matter, the order of columns in the index doesn't matter, and it wouldn't even matter if you used FORCE INDEX.

By analogy, if I asked you to find every occurrence of the word "the" in a book, would you use the index at the back of the book, or would you just read it cover to cover?

Another way you can write the query is the following:

select t1.id, t2.id from users t1 
join users t2 using (username,password,email) 
where t1.id<t2.id

This results in the following explain plan:

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t1
         type: ALL
possible_keys: PRIMARY,users_id_username_password_email
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 16516
        Extra: 
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: t2
         type: ref
possible_keys: PRIMARY,users_id_username_password_email
          key: users_id_username_password_email
      key_len: 201
          ref: test.t1.username,test.t1.password,test.t1.email
         rows: 82
        Extra: Using where

It still does one table-scan of the users table, but it doesn't have to sort the whole table to find the duplicates. It just has to do a key lookup.

For what it's worth, I tested with an index only on (username(64),email(64),password(64)). There's no need to include id in the index, because all InnoDB indexes include the primary key column implicitly.


Here's another query that relies on joins to reduce the result set, and then groups by the smallest id and shows the higher id's that are dupes. Optionally you can also return the columns by which you joined.

select t1.id, /* t1.username, t1.password, t1.email, */ group_concat(t2.id) as dupes
from users t1 
join users t2 
  on (t1.username,t1.password,t1.email) = (t2.username,t2.password,t2.email) 
  and t1.id < t2.id 
left outer join users t3
  on (t1.username,t1.password,t1.email) = (t3.username,t3.password,t3.email) 
  and t1.id > t3.id
where t3.id is null
group by t1.id;

Upvotes: 2

Related Questions