How can I speed up a Group By statement with multiple Joins?

Question

I am having trouble trying to speed up a query that's taking about 11 seconds on only 2 million rows. Here is a link to my sqlfiddle. And here is the statement I am trying to run and my EXPLAIN statement.

The Query:

SELECT crawl.pk Pk,domains.domain Domain, 
CONCAT(schemes.scheme, "://", domains.domain, remainders.remainder) Uri, 
crawl.redirect Redirect FROM crawl 
LEFT JOIN dates ON crawl.date_crawled=dates.pk     
LEFT JOIN schemes ON crawl.scheme=schemes.pk 
LEFT JOIN domains ON crawl.domain=domains.pk 
LEFT JOIN remainders ON crawl.remainder=remainders.pk 
WHERE (dates.date < CURDATE() - INTERVAL 30 DAY) 
AND crawl.redirect=0 
GROUP BY crawl.domain 
ORDER BY crawl.date_crawled ASC 
LIMIT 50

EXPLAIN:

+----+-------------+------------+--------+-----------------------+-----------------------+---------+----------------------------+--------+----------------------------------------------+
| id | select_type | table      | type   | possible_keys         | key                   | key_len | ref                        | rows   | Extra                                        |
+----+-------------+------------+--------+-----------------------+-----------------------+---------+----------------------------+--------+----------------------------------------------+
|  1 | SIMPLE      | dates      | ALL    | PRIMARY,date          | NULL                  | NULL    | NULL                       |      7 | Using where; Using temporary; Using filesort |
|  1 | SIMPLE      | crawl      | ref    | date_crawled_redirect | date_crawled_redirect | 8       | mytable.dates.pk,const     | 408644 |                                              |
|  1 | SIMPLE      | schemes    | eq_ref | PRIMARY               | PRIMARY               | 4       | mytable.crawl.scheme       |      1 |                                              |
|  1 | SIMPLE      | domains    | eq_ref | PRIMARY               | PRIMARY               | 4       | mytable.crawl.domain       |      1 |                                              |
|  1 | SIMPLE      | remainders | eq_ref | PRIMARY               | PRIMARY               | 4       | mytable.crawl.remainder    |      1 |                                              |
+----+-------------+------------+--------+-----------------------+-----------------------+---------+----------------------------+--------+----------------------------------------------+
5 rows in set (2.26 sec)

EDIT #1: As per the comments I've replaced the Left Joins w/ Joins and moved the date filter by the join. This did not reduce the query time, sadly.

SELECT crawl.pk Pk,domains.domain Domain, CONCAT(schemes.scheme, "://", domains.domain, remainders.remainder) Uri, crawl.redirect Redirect
FROM crawl
JOIN schemes ON crawl.scheme=schemes.pk
JOIN domains ON crawl.domain=domains.pk
JOIN remainders ON crawl.remainder=remainders.pk
JOIN dates ON crawl.date_crawled=dates.pk AND dates.date < CURDATE() - INTERVAL 30 DAY
WHERE crawl.redirect=0
GROUP BY crawl.domain
ORDER BY crawl.date_crawled ASC
LIMIT 50

EDIT #2: My updated Explain:

+----+-------------+------------+--------+---------------------------------------------------------+-----------------------+---------+----------------------------+--------+-----------------------------------------------------------+
| id | select_type | table      | type   | possible_keys                                           | key                   | key_len | ref                        | rows   | Extra                                                     |
+----+-------------+------------+--------+---------------------------------------------------------+-----------------------+---------+----------------------------+--------+-----------------------------------------------------------+
|  1 | SIMPLE      | dates      | range  | PRIMARY,date,date_pk,dateBtreeIdx,pk                     | date_pk                | 3       | NULL                       |      4 | Using where; Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | crawl      | ref    | domain_remainder,remainder,scheme,date_crawled_redirect | date_crawled_redirect | 8       | mytable.dates.pk,const     | 408644 |                                                           |
|  1 | SIMPLE      | schemes    | ALL    | PRIMARY                                                 | NULL                  | NULL    | NULL                       |      2 | Using where; Using join buffer                            |
|  1 | SIMPLE      | domains    | eq_ref | PRIMARY                                                 | PRIMARY               | 4       | mytable.crawl.domain       |      1 |                                                           |
|  1 | SIMPLE      | remainders | eq_ref | PRIMARY                                                 | PRIMARY               | 4       | mytable.crawl.remainder    |      1 |                                                           |
+----+-------------+------------+--------+---------------------------------------------------------+-----------------------+---------+----------------------------+--------+-----------------------------------------------------------+

EDIT #3

+----+--------------------+------------+-----------------+------------------------------------------+---------+---------+----------------------------+---------+---------------------------------+
| id | select_type        | table      | type            | possible_keys                            | key     | key_len | ref                        | rows    | Extra                           |
+----+--------------------+------------+-----------------+------------------------------------------+---------+---------+----------------------------+---------+---------------------------------+
|  1 | PRIMARY            | schemes    | ALL             | PRIMARY                                  | NULL    | NULL    | NULL                       |       2 | Using temporary; Using filesort |
|  1 | PRIMARY            | crawl      | ref             | domain_remainder,remainder,scheme,domain | scheme  | 4       | mytable.schemes.pk         | 1448223 | Using where                     |
|  1 | PRIMARY            | domains    | eq_ref          | PRIMARY                                  | PRIMARY | 4       | mytable.crawl.domain       |       1 |                                 |
|  1 | PRIMARY            | remainders | eq_ref          | PRIMARY                                  | PRIMARY | 4       | mytable.crawl.remainder    |       1 |                                 |
|  2 | DEPENDENT SUBQUERY | dates      | unique_subquery | PRIMARY,date,date_pk,dateBtreeIdx,pk     | PRIMARY | 4       | func                       |       1 | Using where                     |
+----+--------------------+------------+-----------------+------------------------------------------+---------+---------+----------------------------+---------+---------------------------------+
5 rows in set (0.04 sec)

EDIT #4:

+----+-------------+------------+--------+--------------------------------------+-------------------------+---------+----------------------------+---------+-----------------------------------------------------------+
| id | select_type | table      | type   | possible_keys                        | key                     | key_len | ref                        | rows    | Extra                                                     |
+----+-------------+------------+--------+--------------------------------------+-------------------------+---------+----------------------------+---------+-----------------------------------------------------------+
|  1 | SIMPLE      | dates      | range  | PRIMARY,date,date_pk,dateBtreeIdx,pk | date_pk                 | 3       | NULL                       |       4 | Using where; Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | schemes    | ALL    | PRIMARY                              | NULL                    | NULL    | NULL                       |       2 | Using join buffer                                         |
|  1 | SIMPLE      | crawl      | ref    | scheme_domain_remainder              | scheme_domain_remainder | 4       | mytable.schemes.pk         | 1455517 | Using where                                               |
|  1 | SIMPLE      | domains    | eq_ref | PRIMARY                              | PRIMARY                 | 4       | mytable.crawl.domain       |       1 |                                                           |
|  1 | SIMPLE      | remainders | eq_ref | PRIMARY                              | PRIMARY                 | 4       | mytable.crawl.remainder    |       1 |                                                           |
+----+-------------+------------+--------+--------------------------------------+-------------------------+---------+----------------------------+---------+-----------------------------------------------------------+
5 rows in set (0.04 sec)

EDIT #5

SELECT urls.pk PK, domains.domain Domain, CONCAT(schemes.scheme, "://", domains.domain, remainders.remainder) Uri, urls.redirect Redirect, urls.date_crawled DC FROM
(SELECT * FROM (
 SELECT * FROM crawl as urls ORDER BY date_crawled ASC
) AS tmp GROUP BY tmp.domain ) as urls
JOIN schemes ON urls.scheme=schemes.pk
JOIN domains ON urls.domain=domains.pk
JOIN remainders ON urls.remainder=remainders.pk
JOIN dates ON urls.date_crawled=dates.pk AND dates.date < CURDATE() - INTERVAL 30 DAY
WHERE urls.redirect=0
ORDER BY urls.date_crawled ASC
LIMIT 50

Benvorth · Accepted Answer

You have a nearly optimal query at hand. The only problem results from a not-optimal index in table dates. As you can see in your EXPLAIN output MySQL is not able to use any index in table dates so it is used as first table. This leads to a semi-optimal execution plan for your table crawl with a huge number of lines to be accessed.

To improve this you should add an BTREE index on your dates.date column:

ALTER TABLE dates ADD INDEX dateBtreeIdx USING BTREE (date)

BTREE-indices are used for range-conditions. In your case the "lower than", see here.

Based on that you can try to add the join-field Dates.pk to the index as well. This might Speed up your query further, but depends on your data.

Edit

Now MySQL can use the Index on date.dates (type = RANGE and rows = 4). You don't see a speedup because now the optimizer wont use the PRIMARY KEY in schemes...

But the big Performance-issue stays with crawl. Try a different Approach with IN queries:

SELECT 
    crawl.pk Pk, domains.domain Domain, 
    CONCAT(schemes.scheme, "://", domains.domain, remainders.remainder) Uri, 
    crawl.redirect Redirect 
FROM 
    crawl, schemes, domains, remainders
WHERE 
    crawl.scheme=schemes.pk
    AND crawl.domain=domains.pk
    AND crawl.remainder=remainders.pk

    AND crawl.date_crawled IN (SELECT pk FROM dates WHERE (dates.date < CURDATE() - INTERVAL 30 DAY))
    AND crawl.redirect=0 

GROUP BY 
    crawl.domain 
ORDER BY 
    crawl.date_crawled ASC 
LIMIT 50

Edit #2

SELECT 
    urls.pk PK, domains.domain Domain, 
    CONCAT(schemes.scheme, "://", domains.domain, remainders.remainder) Uri, 
    urls.redirect Redirect, 
    urls.date_crawled DC 
FROM 
    (SELECT pk, redirect, date_crawled FROM crawl GROUP BY `domain` ) as urls
JOIN schemes ON urls.scheme=schemes.pk
JOIN domains ON urls.`domain`=domains.pk
JOIN remainders ON urls.remainder=remainders.pk
JOIN dates ON urls.date_crawled=dates.pk AND dates.date < CURDATE() - INTERVAL 30 DAY
WHERE 
    urls.redirect=0
ORDER BY urls.date_crawled ASC
LIMIT 50

How can I speed up a Group By statement with multiple Joins?

Answers (1)

Related Questions