Reputation: 50031
My understanding is that in (My)SQL a SELECT DISTINCT
should do the same thing as a GROUP BY
on all columns, except that GROUP BY
does implicit sorting, so these two queries should be the same:
SELECT boardID,threadID FROM posts GROUP BY boardID,threadID ORDER BY NULL LIMIT 100;
SELECT DISTINCT boardID,threadID FROM posts LIMIT 100;
They're both giving me the same results, and they're giving identical output from EXPLAIN
:
+----+-------------+-------+------+---------------+------+---------+------+---------+-----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+---------+-----------------+
| 1 | SIMPLE | posts | ALL | NULL | NULL | NULL | NULL | 1263320 | Using temporary |
+----+-------------+-------+------+---------------+------+---------+------+---------+-----------------+
1 row in set
But on my table the query with DISTINCT
consistently returns instantly and the one with GROUP BY
takes about 4 seconds. I've disabled the query cache to test this.
There's 25 columns so I've also tried creating a separate table containing only the boardID and threadID columns, but the same problem and performance difference persists.
I have to use GROUP BY
instead of DISTINCT
so I can include additional columns without them being included in the evaluation of DISTINCT
. So now I don't how to proceed. Why is there a difference?
Upvotes: 2
Views: 4459
Reputation: 116167
First of all, your queries are not quite the same - GROUP BY has ORDER BY, but DISTINCT does not.
Note, that in either case, index is NOT used, and that cannot be good for performance.
I would suggest creating compound index for (boardid, threadid)
- this should let both queries to make use of index and both should start working much faster
EDIT: Explanation why SELECT DISTINCT ... LIMIT 100
is faster than GROUP BY ... LIMIT 100
when you do not have indexes.
To execute first statement (SELECT DISTINCT
) server only needs to fetch 100, maybe slightly more rows and can stop as soon as it has 100 different rows - no more work to do.
This is because original SQL statement did not specify any order, so server can deliver any 100 rows as it pleases, as long as they are distinct. But, if you were to impose any index-less ORDER BY
on this before LIMIT 100
, this query will immediately become slow.
To execute second statement (SELECT ... GROUP BY ... LIMIT 100
), MySQL always does implicit ORDER BY
by the same columns as were used in GROUP BY. In other words, it cannot quickly stop after fetching first few 100+ rows until all records are fetched, groupped and sorted. After that, it applies ORDER BY NULL
you added (which does not do much I guess, but dropping it may speed things up), and finally, it gets first 100 rows and throws away remaining result. And of course, this is damn slow.
When you have compound index, all these steps can be done very quickly in either case.
Upvotes: 3