Reputation: 7571
I'd like to understand why the following 2 queries that select duplicates from the companies
table have different execution times. The second query with the JOIN
is executed much faster.
The query is executed on:
companies
table with ~200k recordsname
column values (each query returns ~1500 results, namely the duplicates)nameindex
index on the name
columnWHERE (0.31142875 s):
SELECT *
FROM `companies`
WHERE `name` IN (
SELECT `name`
FROM `companies`
GROUP BY `name`
HAVING COUNT(`name`) > 1
);
JOIN (0.07034850 s):
SELECT *
FROM `companies`
INNER JOIN (
SELECT `name`
FROM `companies`
GROUP BY `name`
HAVING COUNT(`name`) > 1
) AS `duplicate_names` ON `companies`.`name` = `duplicate_names`.`name`;
Note that the subqueries are exactly the same. Why is it that in this specific setup the second query is faster?
The output from EXPLAIN
is:
WHERE query:
id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | companies | NULL | ALL | NULL | NULL | NULL | NULL | 195258 | 100.00 | Using where |
2 | SUBQUERY | companies | NULL | index | nameindex | nameindex | 1022 | NULL | 195258 | 100.00 | Using index |
JOIN query:
id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | NULL | ALL | NULL | NULL | NULL | NULL | 195258 | 100.00 | NULL | |
1 | PRIMARY | companies | NULL | ref | nameindex | nameindex | 1022 | duplicate_names.name | 1 | 100.00 | NULL |
2 | DERIVED | companies | NULL | index | nameindex | nameindex | 1022 | NULL | 195258 | 100.00 | Using index |
Upvotes: 0
Views: 94
Reputation: 5623
In my opinion, the choice of an execution plan, as well as the preference of a particular plan, strongly depends on the selectivity of the index.
I tried to create a test table with a different number of duplicates and compared the execution plans and query performance on this table.
Results:
For low part of duplicates. In OP table 1500/300000=0.5%
Total rows | Duplicates | Join | In |
---|---|---|---|
15000 | 62 (0.4%) | 0.01211000 | 0.02565750 |
15000 | 118 | 0.01559225 | 0.02836675 |
15000 | 133 | 0.01403475 | 0.02834625 |
15000 | 288 | 0.02084200 | 0.02859325 |
15000 | 1015 | 0.05302300 | 0.06872775 |
15000 | 1122 | 0.05069350 | 0.05169975 |
15000 | 2085 | 0.04678600 | 0.04377050 |
15000 | 3744 | 0.05224775 | 0.05224775 |
Plan for JOIN
-> Nested loop inner join (cost=12028 rows=14888) (actual time=10.8..11.7 rows=124 loops=1)
-> Table scan on duplicate_names (cost=3843..3950 rows=8410) (actual time=10.7..10.8 rows=62 loops=1)
-> Materialize (cost=3843..3843 rows=8410) (actual time=10.7..10.7 rows=62 loops=1)
-> Filter: (count(0) > 1) (cost=3002 rows=8410) (actual time=0.0488..10.7 rows=62 loops=1)
-> Group aggregate: count(0) (cost=3002 rows=8410) (actual time=0.045..9.72 rows=14938 loops=1)
-> Covering index scan on companies using ix_name (cost=1513 rows=14888) (actual time=0.0379..4.67 rows=15000 loops=1)
-> Index lookup on companies using ix_name (name=duplicate_names.`name`) (cost=0.443 rows=1.77) (actual time=0.014..0.0152 rows=2 loops=62)
For IN
-> Filter: <in_optimizer>(companies.`name`,companies.`name` in (select #2)) (cost=1513 rows=14888) (actual time=11.1..37.2 rows=124 loops=1)
-> Table scan on companies (cost=1513 rows=14888) (actual time=0.0463..6.53 rows=15000 loops=1)
-> Select #2 (subquery in condition; run only once)
-> Filter: ((companies.`name` = `<materialized_subquery>`.`name`)) (cost=3843..3843 rows=1) (actual time=0.00161..0.00161 rows=0.00827 loops=15001)
-> Limit: 1 row(s) (cost=3843..3843 rows=1) (actual time=0.00146..0.00146 rows=0.00827 loops=15001)
-> Index lookup on <materialized_subquery> using <auto_distinct_key> (name=companies.`name`) (actual time=0.00132..0.00132 rows=0.00827 loops=15001)
-> Materialize with deduplication (cost=3843..3843 rows=8410) (actual time=10.9..10.9 rows=62 loops=1)
-> Filter: (count(0) > 1) (cost=3002 rows=8410) (actual time=0.0351..10.8 rows=62 loops=1)
-> Group aggregate: count(0) (cost=3002 rows=8410) (actual time=0.0314..9.89 rows=14938 loops=1)
-> Covering index scan on companies using ix_name (cost=1513 rows=14888) (actual time=0.0265..4.67 rows=15000 loops=1)
While duplicate names part is large, plan is different.
JOIN
EXPLAIN
-> Nested loop inner join (cost=13500 rows=5000) (actual time=10.5..31.4 rows=4972 loops=1)
-> Table scan on companies (cost=500 rows=5000) (actual time=0.0496..4.7 rows=5000 loops=1)
-> Covering index lookup on duplicate_names using <auto_key0> (name=companies.`name`) (cost=1001..1003 rows=10) (actual time=0.0046..0.00508 rows=0.994 loops=5000)
-> Materialize (cost=1000..1000 rows=1) (actual time=10.3..10.3 rows=967 loops=1)
-> Filter: (count(0) > 1) (cost=1000 rows=1) (actual time=0.0435..5.78 rows=967 loops=1)
-> Group aggregate: count(0) (cost=1000 rows=1) (actual time=0.0396..5.51 rows=995 loops=1)
-> Covering index scan on companies using ix_name (cost=500 rows=5000) (actual time=0.0329..2.33 rows=5000 loops=1)
IN
EXPLAIN
-> Filter: <in_optimizer>(companies.`name`,companies.`name` in (select #2)) (cost=500 rows=5000) (actual time=8.01..28 rows=4972 loops=1)
-> Table scan on companies (cost=500 rows=5000) (actual time=0.0621..3.37 rows=5000 loops=1)
-> Select #2 (subquery in condition; run only once)
-> Filter: ((companies.`name` = `<materialized_subquery>`.`name`)) (cost=1000..1000 rows=1) (actual time=0.00407..0.00407 rows=0.994 loops=4994)
-> Limit: 1 row(s) (cost=1000..1000 rows=1) (actual time=0.00313..0.00313 rows=0.994 loops=4994)
-> Index lookup on <materialized_subquery> using <auto_distinct_key> (name=companies.`name`) (actual time=0.0029..0.0029 rows=0.994 loops=4994)
-> Materialize with deduplication (cost=1000..1000 rows=1) (actual time=7.79..7.79 rows=967 loops=1)
-> Filter: (count(0) > 1) (cost=1000 rows=1) (actual time=0.107..7.02 rows=967 loops=1)
-> Group aggregate: count(0) (cost=1000 rows=1) (actual time=0.103..6.82 rows=995 loops=1)
-> Covering index scan on companies using ix_name (cost=500 rows=5000) (actual time=0.0936..3.33 rows=5000 loops=1)
Total rows | Duplicates | Join | In |
---|---|---|---|
10000 | 3125 | 0.05110900 | 0.03762375 |
5000 | 1816 | 0.02856050 | 0.02632475 |
5000 | 1794 | 0.08704400 | 0.02750950 |
5000 | 1862 | 0.06159625 | 0.03077825 |
5000 | 2844 | 0.06492925 | 0.03758150 |
5000 | 4003 | 0.05222750 | 0.05222750 |
5000 | 4007 | 0.06000325 | 0.04444525 |
5000 | 4007 | 0.04093250 | 0.03153725 |
5000 | 4004 | 0.02923175 | 0.04555700 |
5000 | 4005 | 0.04226600 | 0.03515675 |
The query execution time also depends on other factors, especially on such a small table of 5000 rows. But, nevertheless, the assumption that a particular execution plan is preferable to the selectivity of the data is confirmed.
Test table
create table companies (id int primary key auto_increment
, name varchar(50) not null,Address varchar(100)
,LegalName varchar(100));
create index ix_name on companies (name);
Upvotes: 0
Reputation: 31
It looks like subquery materialization would be more efficient for the 1st statement, but the optimizer chooses SUBQUERY
execution strategy.
Are the statistics collected? If not, try collecting them: ANALYZE TABLE companies
and re-running the 1st statement.
If you're on MySQL 8.0 or later you can use the optimizer hint:
SELECT *
FROM `companies`
WHERE `name` IN (
SELECT /*+ SUBQUERY(MATERIALIZATION) */ `name`
FROM `companies`
GROUP BY `name`
HAVING COUNT(`name`) > 1
);
Upvotes: 0