chris
chris

Reputation: 754

Distinct vs Group By

I have two tables like this. The 'order' table has 21886 rows.

CREATE TABLE `order` (
  `id` bigint(20) unsigned NOT NULL,
  `reg_date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `idx_reg_date` (`reg_date`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci


CREATE TABLE `order_detail_products` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `order_id` bigint(20) unsigned NOT NULL,
  `order_detail_id` int(11) NOT NULL,
  `prod_id` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_order_detail_id` (`order_detail_id`,`prod_id`),
  KEY `idx_order_id` (`order_id`,`order_detail_id`,`prod_id`)
) ENGINE=InnoDB AUTO_INCREMENT=572375 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

My question is here.

MariaDB [test]> explain
    -> SELECT DISTINCT A.id
    -> FROM order A
    -> JOIN order_detail_products B ON A.id = B.order_id
    -> ORDER BY A.reg_date DESC LIMIT 100, 30;
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+-------+----------------------------------------------+
| id   | select_type | table | type  | possible_keys | key          | key_len | ref               | rows  | Extra                                        |
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+-------+----------------------------------------------+
|    1 | SIMPLE      | A     | index | PRIMARY       | idx_reg_date | 8       | NULL              | 22151 | Using index; Using temporary; Using filesort |
|    1 | SIMPLE      | B     | ref   | idx_order_id  | idx_order_id | 8       | bom_20140804.A.id |     2 | Using index; Distinct                        |
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+-------+----------------------------------------------+
2 rows in set (0.00 sec)

MariaDB [test]> explain
    -> SELECT A.id
    -> FROM order A
    -> JOIN order_detail_products B ON A.id = B.order_id
    -> GROUP BY A.id
    -> ORDER BY A.reg_date DESC LIMIT 100, 30;
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+------+------------------------------+
| id   | select_type | table | type  | possible_keys | key          | key_len | ref               | rows | Extra                        |
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+------+------------------------------+
|    1 | SIMPLE      | A     | index | PRIMARY       | idx_reg_date | 8       | NULL              |   65 | Using index; Using temporary |
|    1 | SIMPLE      | B     | ref   | idx_order_id  | idx_order_id | 8       | bom_20140804.A.id |    2 | Using index                  |
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+------+------------------------------+

Listed above, two queries returns same result but distinct is too slow(explain too many rows). What's the difference?

Upvotes: 11

Views: 10475

Answers (2)

Paul Spiegel
Paul Spiegel

Reputation: 31792

It is usually advised to use DISTINCT instead of GROUP BY, since that is what you actually want, and let the optimizer choose the "best" execution plan. However - no optimizer is perfect. Using DISTINCT the optimizer can have more options for an execution plan. But that also means that it has more options to choose a bad plan.

You write that the DISTINCT query is "slow", but you don't tell any numbers. In my test (with 10 times as many rows on MariaDB 10.0.19 and 10.3.13) the DISTINCT query is like (only) 25% slower (562ms/453ms). The EXPLAIN result is no help at all. It's even "lying". With LIMIT 100, 30 it would need to read at least 130 rows (that's what my EXPLAIN actually schows for GROUP BY), but it shows you 65.

I can't explain the 25% difference in execution time, but it seems that the engine is doing a full table/index scan in any case, and sorts the result before it can skip 100 and select 30 rows.

The best plan would probably be:

  • Read rows from idx_reg_date index (table A) one by one in descending order
  • Look if there is a match in the idx_order_id index (table B)
  • Skip 100 matching rows
  • Send 30 matching rows
  • Exit

If there are like 10% of rows in A which have no match in B, this plan would read something like 143 rows from A.

Best I could do to somehow force this plan is:

SELECT A.id
FROM `order` A
WHERE EXISTS (SELECT * FROM order_detail_products B WHERE A.id = B.order_id)
ORDER BY A.reg_date DESC
LIMIT 30
OFFSET 100

This query returns the same result in 156 ms (3 times faster than GROUP BY). But that is still too slow. And it's probaly still reading all rows in table A.

We can proof that a better plan can exist with a "little" subquery trick:

SELECT A.id
FROM (
    SELECT id, reg_date
    FROM `order`
    ORDER BY reg_date DESC
    LIMIT 1000
) A
WHERE EXISTS (SELECT * FROM order_detail_products B WHERE A.id = B.order_id)
ORDER BY A.reg_date DESC
LIMIT 30
OFFSET 100

This query executes in "no time" (~ 0 ms) and returns the same result on my test data. And though it's not 100% reliable, it shows that the optimizer is not doing a good job.

So what are my conclusions:

  • The optimizer does not always do the best job and sometimes needs help
  • Even when we know "the best plan", we can not always enforce it
  • DISTINCT is not always faster than GROUP BY
  • When no index can be used for all clauses - things are getting quite tricky

Test schema and dummy data:

drop table if exists `order`;
CREATE TABLE `order` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `reg_date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `idx_reg_date` (`reg_date`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

insert into `order`(reg_date)
    select from_unixtime(floor(rand(1) * 1000000000)) as reg_date
    from information_schema.COLUMNS a
       , information_schema.COLUMNS b
    limit 218860;

drop table if exists `order_detail_products`;
CREATE TABLE `order_detail_products` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `order_id` bigint(20) unsigned NOT NULL,
  `order_detail_id` int(11) NOT NULL,
  `prod_id` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_order_detail_id` (`order_detail_id`,`prod_id`),
  KEY `idx_order_id` (`order_id`,`order_detail_id`,`prod_id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

insert into order_detail_products(id, order_id, order_detail_id, prod_id)
    select null as id
    , floor(rand(2)*218860)+1 as order_id
    , 0 as order_detail_id
    , 0 as prod_id
    from information_schema.COLUMNS a
       , information_schema.COLUMNS b
    limit 437320;

Queries:

SELECT DISTINCT A.id
FROM `order` A
JOIN order_detail_products B ON A.id = B.order_id
ORDER BY A.reg_date DESC
LIMIT 30 OFFSET 100;
-- 562 ms

SELECT A.id
FROM `order` A
JOIN order_detail_products B ON A.id = B.order_id
GROUP BY A.id
ORDER BY A.reg_date DESC
LIMIT 30 OFFSET 100;
-- 453 ms

SELECT A.id
FROM `order` A
WHERE EXISTS (SELECT * FROM order_detail_products B WHERE A.id = B.order_id)
ORDER BY A.reg_date DESC
LIMIT 30 OFFSET 100;
-- 156 ms

SELECT A.id
FROM (
    SELECT id, reg_date
    FROM `order`
    ORDER BY reg_date DESC
    LIMIT 1000
) A
WHERE EXISTS (SELECT * FROM order_detail_products B WHERE A.id = B.order_id)
ORDER BY A.reg_date DESC
LIMIT 30 OFFSET 100;
-- ~ 0 ms

Upvotes: 3

user5509289
user5509289

Reputation:

I believe your select distinct is slow because you broke the index by matching on another table. In most cases, select distinct will be faster. But in this case, since you are matching on parameters of another table, the index is broken and is much slower.

Upvotes: 0

Related Questions