Reputation: 3908
I am having trouble trying to speed up a query that's taking about 11 seconds on only 2 million rows. Here is a link to my sqlfiddle. And here is the statement I am trying to run and my EXPLAIN statement.
The Query:
SELECT crawl.pk Pk,domains.domain Domain,
CONCAT(schemes.scheme, "://", domains.domain, remainders.remainder) Uri,
crawl.redirect Redirect FROM crawl
LEFT JOIN dates ON crawl.date_crawled=dates.pk
LEFT JOIN schemes ON crawl.scheme=schemes.pk
LEFT JOIN domains ON crawl.domain=domains.pk
LEFT JOIN remainders ON crawl.remainder=remainders.pk
WHERE (dates.date < CURDATE() - INTERVAL 30 DAY)
AND crawl.redirect=0
GROUP BY crawl.domain
ORDER BY crawl.date_crawled ASC
LIMIT 50
EXPLAIN:
+----+-------------+------------+--------+-----------------------+-----------------------+---------+----------------------------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+-----------------------+-----------------------+---------+----------------------------+--------+----------------------------------------------+
| 1 | SIMPLE | dates | ALL | PRIMARY,date | NULL | NULL | NULL | 7 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | crawl | ref | date_crawled_redirect | date_crawled_redirect | 8 | mytable.dates.pk,const | 408644 | |
| 1 | SIMPLE | schemes | eq_ref | PRIMARY | PRIMARY | 4 | mytable.crawl.scheme | 1 | |
| 1 | SIMPLE | domains | eq_ref | PRIMARY | PRIMARY | 4 | mytable.crawl.domain | 1 | |
| 1 | SIMPLE | remainders | eq_ref | PRIMARY | PRIMARY | 4 | mytable.crawl.remainder | 1 | |
+----+-------------+------------+--------+-----------------------+-----------------------+---------+----------------------------+--------+----------------------------------------------+
5 rows in set (2.26 sec)
EDIT #1: As per the comments I've replaced the Left Joins w/ Joins and moved the date filter by the join. This did not reduce the query time, sadly.
SELECT crawl.pk Pk,domains.domain Domain, CONCAT(schemes.scheme, "://", domains.domain, remainders.remainder) Uri, crawl.redirect Redirect
FROM crawl
JOIN schemes ON crawl.scheme=schemes.pk
JOIN domains ON crawl.domain=domains.pk
JOIN remainders ON crawl.remainder=remainders.pk
JOIN dates ON crawl.date_crawled=dates.pk AND dates.date < CURDATE() - INTERVAL 30 DAY
WHERE crawl.redirect=0
GROUP BY crawl.domain
ORDER BY crawl.date_crawled ASC
LIMIT 50
EDIT #2: My updated Explain:
+----+-------------+------------+--------+---------------------------------------------------------+-----------------------+---------+----------------------------+--------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+---------------------------------------------------------+-----------------------+---------+----------------------------+--------+-----------------------------------------------------------+
| 1 | SIMPLE | dates | range | PRIMARY,date,date_pk,dateBtreeIdx,pk | date_pk | 3 | NULL | 4 | Using where; Using index; Using temporary; Using filesort |
| 1 | SIMPLE | crawl | ref | domain_remainder,remainder,scheme,date_crawled_redirect | date_crawled_redirect | 8 | mytable.dates.pk,const | 408644 | |
| 1 | SIMPLE | schemes | ALL | PRIMARY | NULL | NULL | NULL | 2 | Using where; Using join buffer |
| 1 | SIMPLE | domains | eq_ref | PRIMARY | PRIMARY | 4 | mytable.crawl.domain | 1 | |
| 1 | SIMPLE | remainders | eq_ref | PRIMARY | PRIMARY | 4 | mytable.crawl.remainder | 1 | |
+----+-------------+------------+--------+---------------------------------------------------------+-----------------------+---------+----------------------------+--------+-----------------------------------------------------------+
EDIT #3
+----+--------------------+------------+-----------------+------------------------------------------+---------+---------+----------------------------+---------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+------------+-----------------+------------------------------------------+---------+---------+----------------------------+---------+---------------------------------+
| 1 | PRIMARY | schemes | ALL | PRIMARY | NULL | NULL | NULL | 2 | Using temporary; Using filesort |
| 1 | PRIMARY | crawl | ref | domain_remainder,remainder,scheme,domain | scheme | 4 | mytable.schemes.pk | 1448223 | Using where |
| 1 | PRIMARY | domains | eq_ref | PRIMARY | PRIMARY | 4 | mytable.crawl.domain | 1 | |
| 1 | PRIMARY | remainders | eq_ref | PRIMARY | PRIMARY | 4 | mytable.crawl.remainder | 1 | |
| 2 | DEPENDENT SUBQUERY | dates | unique_subquery | PRIMARY,date,date_pk,dateBtreeIdx,pk | PRIMARY | 4 | func | 1 | Using where |
+----+--------------------+------------+-----------------+------------------------------------------+---------+---------+----------------------------+---------+---------------------------------+
5 rows in set (0.04 sec)
EDIT #4:
+----+-------------+------------+--------+--------------------------------------+-------------------------+---------+----------------------------+---------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+--------------------------------------+-------------------------+---------+----------------------------+---------+-----------------------------------------------------------+
| 1 | SIMPLE | dates | range | PRIMARY,date,date_pk,dateBtreeIdx,pk | date_pk | 3 | NULL | 4 | Using where; Using index; Using temporary; Using filesort |
| 1 | SIMPLE | schemes | ALL | PRIMARY | NULL | NULL | NULL | 2 | Using join buffer |
| 1 | SIMPLE | crawl | ref | scheme_domain_remainder | scheme_domain_remainder | 4 | mytable.schemes.pk | 1455517 | Using where |
| 1 | SIMPLE | domains | eq_ref | PRIMARY | PRIMARY | 4 | mytable.crawl.domain | 1 | |
| 1 | SIMPLE | remainders | eq_ref | PRIMARY | PRIMARY | 4 | mytable.crawl.remainder | 1 | |
+----+-------------+------------+--------+--------------------------------------+-------------------------+---------+----------------------------+---------+-----------------------------------------------------------+
5 rows in set (0.04 sec)
EDIT #5
SELECT urls.pk PK, domains.domain Domain, CONCAT(schemes.scheme, "://", domains.domain, remainders.remainder) Uri, urls.redirect Redirect, urls.date_crawled DC FROM
(SELECT * FROM (
SELECT * FROM crawl as urls ORDER BY date_crawled ASC
) AS tmp GROUP BY tmp.domain ) as urls
JOIN schemes ON urls.scheme=schemes.pk
JOIN domains ON urls.domain=domains.pk
JOIN remainders ON urls.remainder=remainders.pk
JOIN dates ON urls.date_crawled=dates.pk AND dates.date < CURDATE() - INTERVAL 30 DAY
WHERE urls.redirect=0
ORDER BY urls.date_crawled ASC
LIMIT 50
Upvotes: 3
Views: 111
Reputation: 7722
You have a nearly optimal query at hand. The only problem results from a not-optimal index in table dates
. As you can see in your EXPLAIN
output MySQL is not able to use any index in table dates
so it is used as first table. This leads to a semi-optimal execution plan for your table crawl
with a huge number of lines to be accessed.
To improve this you should add an BTREE
index on your dates.date
column:
ALTER TABLE dates ADD INDEX dateBtreeIdx USING BTREE (date)
BTREE-indices are used for range-conditions. In your case the "lower than", see here.
Based on that you can try to add the join-field Dates.pk
to the index as well. This might Speed up your query further, but depends on your data.
Edit
Now MySQL can use the Index on date.dates
(type = RANGE and rows = 4). You don't see a speedup because now the optimizer wont use the PRIMARY KEY
in schemes
...
But the big Performance-issue stays with crawl
. Try a different Approach with IN
queries:
SELECT
crawl.pk Pk, domains.domain Domain,
CONCAT(schemes.scheme, "://", domains.domain, remainders.remainder) Uri,
crawl.redirect Redirect
FROM
crawl, schemes, domains, remainders
WHERE
crawl.scheme=schemes.pk
AND crawl.domain=domains.pk
AND crawl.remainder=remainders.pk
AND crawl.date_crawled IN (SELECT pk FROM dates WHERE (dates.date < CURDATE() - INTERVAL 30 DAY))
AND crawl.redirect=0
GROUP BY
crawl.domain
ORDER BY
crawl.date_crawled ASC
LIMIT 50
Edit #2
SELECT
urls.pk PK, domains.domain Domain,
CONCAT(schemes.scheme, "://", domains.domain, remainders.remainder) Uri,
urls.redirect Redirect,
urls.date_crawled DC
FROM
(SELECT pk, redirect, date_crawled FROM crawl GROUP BY `domain` ) as urls
JOIN schemes ON urls.scheme=schemes.pk
JOIN domains ON urls.`domain`=domains.pk
JOIN remainders ON urls.remainder=remainders.pk
JOIN dates ON urls.date_crawled=dates.pk AND dates.date < CURDATE() - INTERVAL 30 DAY
WHERE
urls.redirect=0
ORDER BY urls.date_crawled ASC
LIMIT 50
Upvotes: 2