Flame
Flame

Reputation: 7571

MySQL Where vs INNER JOIN performance

I'd like to understand why the following 2 queries that select duplicates from the companies table have different execution times. The second query with the JOIN is executed much faster.

The query is executed on:

WHERE (0.31142875 s):

SELECT *
FROM `companies`
WHERE `name` IN (
    SELECT `name`
    FROM `companies`
    GROUP BY `name`
    HAVING COUNT(`name`) > 1
);

JOIN (0.07034850 s):

SELECT *
FROM `companies`
INNER JOIN (
    SELECT `name`
    FROM `companies`
    GROUP BY `name`
    HAVING COUNT(`name`) > 1
) AS `duplicate_names` ON `companies`.`name` = `duplicate_names`.`name`;  

Note that the subqueries are exactly the same. Why is it that in this specific setup the second query is faster?

The output from EXPLAIN is:

WHERE query:

id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 PRIMARY companies NULL ALL NULL NULL NULL NULL 195258 100.00 Using where
2 SUBQUERY companies NULL index nameindex nameindex 1022 NULL 195258 100.00 Using index

JOIN query:

id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 PRIMARY NULL ALL NULL NULL NULL NULL 195258 100.00 NULL
1 PRIMARY companies NULL ref nameindex nameindex 1022 duplicate_names.name 1 100.00 NULL
2 DERIVED companies NULL index nameindex nameindex 1022 NULL 195258 100.00 Using index

Upvotes: 0

Views: 94

Answers (2)

ValNik
ValNik

Reputation: 5623

In my opinion, the choice of an execution plan, as well as the preference of a particular plan, strongly depends on the selectivity of the index.
I tried to create a test table with a different number of duplicates and compared the execution plans and query performance on this table.

Results:
For low part of duplicates. In OP table 1500/300000=0.5%

Total rows Duplicates Join In
15000 62 (0.4%) 0.01211000 0.02565750
15000 118 0.01559225 0.02836675
15000 133 0.01403475 0.02834625
15000 288 0.02084200 0.02859325
15000 1015 0.05302300 0.06872775
15000 1122 0.05069350 0.05169975
15000 2085 0.04678600 0.04377050
15000 3744 0.05224775 0.05224775

Plan for JOIN

-> Nested loop inner join  (cost=12028 rows=14888) (actual time=10.8..11.7 rows=124 loops=1)
    -> Table scan on duplicate_names  (cost=3843..3950 rows=8410) (actual time=10.7..10.8 rows=62 loops=1)
        -> Materialize  (cost=3843..3843 rows=8410) (actual time=10.7..10.7 rows=62 loops=1)
            -> Filter: (count(0) > 1)  (cost=3002 rows=8410) (actual time=0.0488..10.7 rows=62 loops=1)
                -> Group aggregate: count(0)  (cost=3002 rows=8410) (actual time=0.045..9.72 rows=14938 loops=1)
                    -> Covering index scan on companies using ix_name  (cost=1513 rows=14888) (actual time=0.0379..4.67 rows=15000 loops=1)
    -> Index lookup on companies using ix_name (name=duplicate_names.`name`)  (cost=0.443 rows=1.77) (actual time=0.014..0.0152 rows=2 loops=62)

For IN

-> Filter: <in_optimizer>(companies.`name`,companies.`name` in (select #2))  (cost=1513 rows=14888) (actual time=11.1..37.2 rows=124 loops=1)
    -> Table scan on companies  (cost=1513 rows=14888) (actual time=0.0463..6.53 rows=15000 loops=1)
    -> Select #2 (subquery in condition; run only once)
        -> Filter: ((companies.`name` = `<materialized_subquery>`.`name`))  (cost=3843..3843 rows=1) (actual time=0.00161..0.00161 rows=0.00827 loops=15001)
            -> Limit: 1 row(s)  (cost=3843..3843 rows=1) (actual time=0.00146..0.00146 rows=0.00827 loops=15001)
                -> Index lookup on <materialized_subquery> using <auto_distinct_key> (name=companies.`name`)  (actual time=0.00132..0.00132 rows=0.00827 loops=15001)
                    -> Materialize with deduplication  (cost=3843..3843 rows=8410) (actual time=10.9..10.9 rows=62 loops=1)
                        -> Filter: (count(0) > 1)  (cost=3002 rows=8410) (actual time=0.0351..10.8 rows=62 loops=1)
                            -> Group aggregate: count(0)  (cost=3002 rows=8410) (actual time=0.0314..9.89 rows=14938 loops=1)
                                -> Covering index scan on companies using ix_name  (cost=1513 rows=14888) (actual time=0.0265..4.67 rows=15000 loops=1)

While duplicate names part is large, plan is different.

JOIN

EXPLAIN
-> Nested loop inner join  (cost=13500 rows=5000) (actual time=10.5..31.4 rows=4972 loops=1)
    -> Table scan on companies  (cost=500 rows=5000) (actual time=0.0496..4.7 rows=5000 loops=1)
    -> Covering index lookup on duplicate_names using <auto_key0> (name=companies.`name`)  (cost=1001..1003 rows=10) (actual time=0.0046..0.00508 rows=0.994 loops=5000)
        -> Materialize  (cost=1000..1000 rows=1) (actual time=10.3..10.3 rows=967 loops=1)
            -> Filter: (count(0) > 1)  (cost=1000 rows=1) (actual time=0.0435..5.78 rows=967 loops=1)
                -> Group aggregate: count(0)  (cost=1000 rows=1) (actual time=0.0396..5.51 rows=995 loops=1)
                    -> Covering index scan on companies using ix_name  (cost=500 rows=5000) (actual time=0.0329..2.33 rows=5000 loops=1)

IN

EXPLAIN
-> Filter: <in_optimizer>(companies.`name`,companies.`name` in (select #2))  (cost=500 rows=5000) (actual time=8.01..28 rows=4972 loops=1)
    -> Table scan on companies  (cost=500 rows=5000) (actual time=0.0621..3.37 rows=5000 loops=1)
    -> Select #2 (subquery in condition; run only once)
        -> Filter: ((companies.`name` = `<materialized_subquery>`.`name`))  (cost=1000..1000 rows=1) (actual time=0.00407..0.00407 rows=0.994 loops=4994)
            -> Limit: 1 row(s)  (cost=1000..1000 rows=1) (actual time=0.00313..0.00313 rows=0.994 loops=4994)
                -> Index lookup on <materialized_subquery> using <auto_distinct_key> (name=companies.`name`)  (actual time=0.0029..0.0029 rows=0.994 loops=4994)
                    -> Materialize with deduplication  (cost=1000..1000 rows=1) (actual time=7.79..7.79 rows=967 loops=1)
                        -> Filter: (count(0) > 1)  (cost=1000 rows=1) (actual time=0.107..7.02 rows=967 loops=1)
                            -> Group aggregate: count(0)  (cost=1000 rows=1) (actual time=0.103..6.82 rows=995 loops=1)
                                -> Covering index scan on companies using ix_name  (cost=500 rows=5000) (actual time=0.0936..3.33 rows=5000 loops=1)
Total rows Duplicates Join In
10000 3125 0.05110900 0.03762375
5000 1816 0.02856050 0.02632475
5000 1794 0.08704400 0.02750950
5000 1862 0.06159625 0.03077825
5000 2844 0.06492925 0.03758150
5000 4003 0.05222750 0.05222750
5000 4007 0.06000325 0.04444525
5000 4007 0.04093250 0.03153725
5000 4004 0.02923175 0.04555700
5000 4005 0.04226600 0.03515675

The query execution time also depends on other factors, especially on such a small table of 5000 rows. But, nevertheless, the assumption that a particular execution plan is preferable to the selectivity of the data is confirmed.

Fiddle here

Test table

create table companies (id int primary key auto_increment
      , name  varchar(50) not null,Address varchar(100)
      ,LegalName varchar(100));
create index ix_name on companies (name);

Upvotes: 0

Oleg Smirnov
Oleg Smirnov

Reputation: 31

It looks like subquery materialization would be more efficient for the 1st statement, but the optimizer chooses SUBQUERY execution strategy.

Are the statistics collected? If not, try collecting them: ANALYZE TABLE companies and re-running the 1st statement.

If you're on MySQL 8.0 or later you can use the optimizer hint:

SELECT *
FROM `companies`
WHERE `name` IN (
    SELECT /*+ SUBQUERY(MATERIALIZATION) */ `name`
    FROM `companies`
    GROUP BY `name`
    HAVING COUNT(`name`) > 1
);

Upvotes: 0

Related Questions