Boann
Boann

Reputation: 50031

Performance difference between DISTINCT and GROUP BY

My understanding is that in (My)SQL a SELECT DISTINCT should do the same thing as a GROUP BY on all columns, except that GROUP BY does implicit sorting, so these two queries should be the same:

SELECT boardID,threadID FROM posts GROUP BY boardID,threadID ORDER BY NULL LIMIT 100;
SELECT DISTINCT boardID,threadID FROM posts LIMIT 100;

They're both giving me the same results, and they're giving identical output from EXPLAIN:

+----+-------------+-------+------+---------------+------+---------+------+---------+-----------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows    | Extra           |
+----+-------------+-------+------+---------------+------+---------+------+---------+-----------------+
|  1 | SIMPLE      | posts | ALL  | NULL          | NULL | NULL    | NULL | 1263320 | Using temporary |
+----+-------------+-------+------+---------------+------+---------+------+---------+-----------------+
1 row in set

But on my table the query with DISTINCT consistently returns instantly and the one with GROUP BY takes about 4 seconds. I've disabled the query cache to test this.

There's 25 columns so I've also tried creating a separate table containing only the boardID and threadID columns, but the same problem and performance difference persists.

I have to use GROUP BY instead of DISTINCT so I can include additional columns without them being included in the evaluation of DISTINCT. So now I don't how to proceed. Why is there a difference?

Upvotes: 2

Views: 4459

Answers (1)

mvp
mvp

Reputation: 116167

First of all, your queries are not quite the same - GROUP BY has ORDER BY, but DISTINCT does not.

Note, that in either case, index is NOT used, and that cannot be good for performance.

I would suggest creating compound index for (boardid, threadid) - this should let both queries to make use of index and both should start working much faster

EDIT: Explanation why SELECT DISTINCT ... LIMIT 100 is faster than GROUP BY ... LIMIT 100 when you do not have indexes.

To execute first statement (SELECT DISTINCT) server only needs to fetch 100, maybe slightly more rows and can stop as soon as it has 100 different rows - no more work to do. This is because original SQL statement did not specify any order, so server can deliver any 100 rows as it pleases, as long as they are distinct. But, if you were to impose any index-less ORDER BY on this before LIMIT 100, this query will immediately become slow.

To execute second statement (SELECT ... GROUP BY ... LIMIT 100), MySQL always does implicit ORDER BY by the same columns as were used in GROUP BY. In other words, it cannot quickly stop after fetching first few 100+ rows until all records are fetched, groupped and sorted. After that, it applies ORDER BY NULL you added (which does not do much I guess, but dropping it may speed things up), and finally, it gets first 100 rows and throws away remaining result. And of course, this is damn slow.

When you have compound index, all these steps can be done very quickly in either case.

Upvotes: 3

Related Questions