Reputation: 754
I have two tables like this. The 'order' table has 21886 rows.
CREATE TABLE `order` (
`id` bigint(20) unsigned NOT NULL,
`reg_date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `idx_reg_date` (`reg_date`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
CREATE TABLE `order_detail_products` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`order_id` bigint(20) unsigned NOT NULL,
`order_detail_id` int(11) NOT NULL,
`prod_id` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `idx_order_detail_id` (`order_detail_id`,`prod_id`),
KEY `idx_order_id` (`order_id`,`order_detail_id`,`prod_id`)
) ENGINE=InnoDB AUTO_INCREMENT=572375 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
My question is here.
MariaDB [test]> explain
-> SELECT DISTINCT A.id
-> FROM order A
-> JOIN order_detail_products B ON A.id = B.order_id
-> ORDER BY A.reg_date DESC LIMIT 100, 30;
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+-------+----------------------------------------------+
| 1 | SIMPLE | A | index | PRIMARY | idx_reg_date | 8 | NULL | 22151 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | B | ref | idx_order_id | idx_order_id | 8 | bom_20140804.A.id | 2 | Using index; Distinct |
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+-------+----------------------------------------------+
2 rows in set (0.00 sec)
MariaDB [test]> explain
-> SELECT A.id
-> FROM order A
-> JOIN order_detail_products B ON A.id = B.order_id
-> GROUP BY A.id
-> ORDER BY A.reg_date DESC LIMIT 100, 30;
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+------+------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+------+------------------------------+
| 1 | SIMPLE | A | index | PRIMARY | idx_reg_date | 8 | NULL | 65 | Using index; Using temporary |
| 1 | SIMPLE | B | ref | idx_order_id | idx_order_id | 8 | bom_20140804.A.id | 2 | Using index |
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+------+------------------------------+
Listed above, two queries returns same result but distinct is too slow(explain too many rows). What's the difference?
Upvotes: 11
Views: 10475
Reputation: 31792
It is usually advised to use DISTINCT
instead of GROUP BY
, since that is what you actually want, and let the optimizer choose the "best" execution plan. However - no optimizer is perfect. Using DISTINCT
the optimizer can have more options for an execution plan. But that also means that it has more options to choose a bad plan.
You write that the DISTINCT
query is "slow", but you don't tell any numbers. In my test (with 10 times as many rows on MariaDB 10.0.19 and 10.3.13) the DISTINCT
query is like (only) 25% slower (562ms/453ms). The EXPLAIN
result is no help at all. It's even "lying". With LIMIT 100, 30
it would need to read at least 130 rows (that's what my EXPLAIN
actually schows for GROUP BY
), but it shows you 65.
I can't explain the 25% difference in execution time, but it seems that the engine is doing a full table/index scan in any case, and sorts the result before it can skip 100 and select 30 rows.
The best plan would probably be:
idx_reg_date
index (table A
) one by one in descending orderidx_order_id
index (table B
)If there are like 10% of rows in A
which have no match in B
, this plan would read something like 143 rows from A
.
Best I could do to somehow force this plan is:
SELECT A.id
FROM `order` A
WHERE EXISTS (SELECT * FROM order_detail_products B WHERE A.id = B.order_id)
ORDER BY A.reg_date DESC
LIMIT 30
OFFSET 100
This query returns the same result in 156 ms (3 times faster than GROUP BY
). But that is still too slow. And it's probaly still reading all rows in table A
.
We can proof that a better plan can exist with a "little" subquery trick:
SELECT A.id
FROM (
SELECT id, reg_date
FROM `order`
ORDER BY reg_date DESC
LIMIT 1000
) A
WHERE EXISTS (SELECT * FROM order_detail_products B WHERE A.id = B.order_id)
ORDER BY A.reg_date DESC
LIMIT 30
OFFSET 100
This query executes in "no time" (~ 0 ms) and returns the same result on my test data. And though it's not 100% reliable, it shows that the optimizer is not doing a good job.
So what are my conclusions:
DISTINCT
is not always faster than GROUP BY
drop table if exists `order`;
CREATE TABLE `order` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`reg_date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `idx_reg_date` (`reg_date`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
insert into `order`(reg_date)
select from_unixtime(floor(rand(1) * 1000000000)) as reg_date
from information_schema.COLUMNS a
, information_schema.COLUMNS b
limit 218860;
drop table if exists `order_detail_products`;
CREATE TABLE `order_detail_products` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`order_id` bigint(20) unsigned NOT NULL,
`order_detail_id` int(11) NOT NULL,
`prod_id` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `idx_order_detail_id` (`order_detail_id`,`prod_id`),
KEY `idx_order_id` (`order_id`,`order_detail_id`,`prod_id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
insert into order_detail_products(id, order_id, order_detail_id, prod_id)
select null as id
, floor(rand(2)*218860)+1 as order_id
, 0 as order_detail_id
, 0 as prod_id
from information_schema.COLUMNS a
, information_schema.COLUMNS b
limit 437320;
SELECT DISTINCT A.id
FROM `order` A
JOIN order_detail_products B ON A.id = B.order_id
ORDER BY A.reg_date DESC
LIMIT 30 OFFSET 100;
-- 562 ms
SELECT A.id
FROM `order` A
JOIN order_detail_products B ON A.id = B.order_id
GROUP BY A.id
ORDER BY A.reg_date DESC
LIMIT 30 OFFSET 100;
-- 453 ms
SELECT A.id
FROM `order` A
WHERE EXISTS (SELECT * FROM order_detail_products B WHERE A.id = B.order_id)
ORDER BY A.reg_date DESC
LIMIT 30 OFFSET 100;
-- 156 ms
SELECT A.id
FROM (
SELECT id, reg_date
FROM `order`
ORDER BY reg_date DESC
LIMIT 1000
) A
WHERE EXISTS (SELECT * FROM order_detail_products B WHERE A.id = B.order_id)
ORDER BY A.reg_date DESC
LIMIT 30 OFFSET 100;
-- ~ 0 ms
Upvotes: 3
Reputation:
I believe your select distinct is slow because you broke the index by matching on another table. In most cases, select distinct will be faster. But in this case, since you are matching on parameters of another table, the index is broken and is much slower.
Upvotes: 0