Reputation: 32090
I'm trying to find duplicates in my users
table (don't ask, it's a lot of red tape), but I'm having a problem creating an index for the query I've created. The table looks like:
+----------------+---------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------------+---------+------+-----+---------+----------------+
| id | int(10) | NO | PRI | NULL | auto_increment |
| email | text | YES | MUL | NULL | |
| username | text | YES | MUL | NULL | |
| password | text | YES | | NULL | |
+----------------+---------+------+-----+---------+----------------+
There are other fields, but these are what I am searching off of. The query I wrote for finding the duplicates looks like:
SELECT COUNT(username) count,GROUP_CONCAT(id) ids,username,email,password
FROM users
GROUP BY username,email,password
HAVING COUNT(username) > 1
And the index I created is:
CREATE INDEX users_id_username_password_email
ON users id,username(64),password(64),email(64));
Unfortunately, describe doesn't seem to be using this index:
mysql> describe SELECT COUNT(username) count,GROUP_CONCAT(id) ids,
-> username,email,password
-> FROM users
-> GROUP BY username,email,password
-> HAVING COUNT(username) > 1\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: users
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 705418
Extra: Using filesort
So the root question is, what kind of index should I be creating to find duplicate entries on a table such as this?
Edit: Changing the order of the query to match the index did nothing:
mysql> describe SELECT COUNT(username) count,GROUP_CONCAT(id) ids,
-> username,password,email
-> FROM users
-> GROUP BY username,password,email
-> HAVING COUNT(username) > 1\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: users
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 705418
Extra: Using filesort
Upvotes: 1
Views: 132
Reputation: 563011
There's no point for the RDBMS to use an index if it needs to read every row anyway. The presence of an index doesn't matter, the order of columns in the index doesn't matter, and it wouldn't even matter if you used FORCE INDEX.
By analogy, if I asked you to find every occurrence of the word "the" in a book, would you use the index at the back of the book, or would you just read it cover to cover?
Another way you can write the query is the following:
select t1.id, t2.id from users t1
join users t2 using (username,password,email)
where t1.id<t2.id
This results in the following explain plan:
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: t1
type: ALL
possible_keys: PRIMARY,users_id_username_password_email
key: NULL
key_len: NULL
ref: NULL
rows: 16516
Extra:
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: t2
type: ref
possible_keys: PRIMARY,users_id_username_password_email
key: users_id_username_password_email
key_len: 201
ref: test.t1.username,test.t1.password,test.t1.email
rows: 82
Extra: Using where
It still does one table-scan of the users table, but it doesn't have to sort the whole table to find the duplicates. It just has to do a key lookup.
For what it's worth, I tested with an index only on (username(64),email(64),password(64))
. There's no need to include id in the index, because all InnoDB indexes include the primary key column implicitly.
Here's another query that relies on joins to reduce the result set, and then groups by the smallest id and shows the higher id's that are dupes. Optionally you can also return the columns by which you joined.
select t1.id, /* t1.username, t1.password, t1.email, */ group_concat(t2.id) as dupes
from users t1
join users t2
on (t1.username,t1.password,t1.email) = (t2.username,t2.password,t2.email)
and t1.id < t2.id
left outer join users t3
on (t1.username,t1.password,t1.email) = (t3.username,t3.password,t3.email)
and t1.id > t3.id
where t3.id is null
group by t1.id;
Upvotes: 2