Reputation: 16333
I have the following two tables in MySQL (Simplified).
clicks
(InnoDB)
date_added
columnlink_id
which refers to a record in the links
tablelinks
(MyISAM)
I'm trying to run some analytical queries using these tables. I need to pull out some data, about clicks that occurred inside of two specified dates while applying some other user selected filters using other tables and joining them into the links table.
My question revolves around the use of indexes however. When I run the following query:
SELECT
COUNT(1)
FROM
clicks
WHERE
date_added >= '2016-11-01 00:00:00'
AND date_added <= '2016-11-03 23:59:59';
I get a response back in 1.40 sec. Using EXPLAIN
I find that the MySQL uses the index on the date_added
column as expected.
EXPLAIN SELECT COUNT(1) FROM clicks WHERE date_added >= '2016-11-01 00:00:00' AND date_added <= '2016-11-16 23:59:59';
+----+-------------+--------+-------+---------------+------------+---------+------+---------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+-------+---------------+------------+---------+------+---------+--------------------------+
| 1 | SIMPLE | clicks | range | date_added | date_added | 4 | NULL | 1559288 | Using where; Using index |
+----+-------------+--------+-------+---------------+------------+---------+------+---------+--------------------------+
However, when I LEFT JOIN
in my links
table I find that the query takes much longer to execute:
SELECT
COUNT(1) AS clicks
FROM
clicks AS c
LEFT JOIN links AS l ON l.id = c.link_id
WHERE
c.date_added >= '2016-11-01 00:00:00'
AND c.date_added <= '2016-11-16 23:59:59';
Which completed in 6.50 sec. Using EXPLAIN
I find that the index was not used on the date_added
column:
EXPLAIN SELECT COUNT(1) AS clicks FROM clicks AS c LEFT JOIN links AS l ON l.id = c.link_id WHERE c.date_added >= '2016-11-01 00:00:00' AND c.date_added <= '2016-11-16 23:59:59';
+----+-------------+-------+--------+---------------+------------+---------+---------------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+------------+---------+---------------+---------+-------------+
| 1 | SIMPLE | c | range | date_added | date_added | 4 | NULL | 6613278 | Using where |
| 1 | SIMPLE | l | eq_ref | PRIMARY | PRIMARY | 4 | c.link_id | 1 | Using index |
+----+-------------+-------+--------+---------------+------------+---------+---------------+---------+-------------+
As you can see the index isn't being used for the date_added
column in the larger table and seems to take far longer. This seems to get even worse when I join in other tables.
Does anyone know why this is happening or if there's anything I can do to get it to use the index on the date_added
column in the clicks table?
Edit
I've just attempted to get my stats out of the database using a different method. The first step in my method involves pulling out a distinct set of link_id
s from the clicks table. I've found that I'm seeing the same problem here again, without a JOIN. The index is not being used:
My query:
SELECT
DISTINCT(link_id) AS link_id
FROM
clicks
WHERE
date_added >= '2016-11-01 00:00:00'
AND date_added <= '2016-12-05 10:16:00'
This query took almost a minute to complete. I ran an EXPLAIN
on this and found that the query is not using the index as I expected it would:
+----+-------------+---------+-------+---------------+----------+---------+------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+-------+---------------+----------+---------+------+----------+-------------+
| 1 | SIMPLE | clicks | index | date_added | link_id | 4 | NULL | 79786609 | Using where |
+----+-------------+---------+-------+---------------+----------+---------+------+----------+-------------+
I expected that it would use the index on date_added
to filter down the result set and then pull out the distinct link_id
values. Any idea why this is happening? I have an index on link_id
as well as date_added
.
Upvotes: 4
Views: 7695
Reputation: 108851
Do you want to use an ordinary JOIN
in place of the LEFT JOIN
? LEFT JOIN
preserves all the rows on the right, so it will yield the same value of COUNT()
as the unjoined table. If you want to count only the rows from your right-hand table that have matching rows in the left-hand table, use JOIN
, not LEFT JOIN
.
Try dropping your index on date_added
and replacing it with a compound index on (date_added, link_id)
. This sort of index is called a covering index. When the query planner knows it can get everything it needs from an index, it doesn't have to bounce back to the table. In this case the query planner can random-access the index to the beginning of your date range, then do an index range scan to the end of the range. It's still going to have to refer to the other table, though.
(Edit) For the sake of experimentation, try a narrower date range. See if EXPLAIN
changes. In that case, the query planner might be guessing your date_added column's cardinality wrong.
You might try an index hint. For example, try
SELECT COUNT(1) AS clicks
FROM clicks AS c USE INDEX (date_added)
LEFT JOIN links AS l ON l.id = c.link_id
WHERE etc
But, judging from your EXPLAIN
output, you're already doing a range scan on date_added
. Your next step, like it or not, is the compound covering index.
Make sure there's an index on links(id)
. There probably is, because it's probably the PK.
Try using COUNT(*)
instead of COUNT(1)
. It probably won't make a difference, but it's worth a try. COUNT(*)
simply counts rows rather than evaluating something for each row it counts.
(Nitpick) Your date range smells funny. Use <
for the end of your range for best results, like so.
WHERE c.date_added >= '2016-11-01'
AND c.date_added < '2016-11-17';
Edit: Look, the MySQL query planner uses lots of internal knowledge about how tables are structured. And, it can only use one index per table to satisfy a query as of late 2016. That's a limitation.
SELECT DISTINCT column
is actually a fairly complex query, because it has to de-dupe the column
in question. If there's an index on that column, the query planner is likely to use it. Choosing that index means it could not choose some other index.
Compound indexes (covering indexes) sometimes but not always resolve this kind of index-selection dilemma, and allow index dual usage. You can read about all this at http://use-the-index-luke.com/
But if your operational constraints prevent the adding of compound indexes, you'll need to live with the one-second query. It isn't that bad.
Of course, saying you can't add compound indexes to get your job done is like this:
A: stuff is falling off my truck on the freeway.
B: put a tarp over the stuff and tie it down.
A: my boss won't let me put a tarp on the truck.
B: well, then, drive slow.
Upvotes: 2
Reputation: 77926
Not absolutely sure but consider moving the condition from WHERE
condition to JOIN ON
condition since you are performing a outer join (LEFT JOIN
) it makes difference in performance unlike inner join
where the condition be it on where
or join on
clause is equivalent.
SELECT COUNT(1) AS clicks
FROM clicks AS c
LEFT JOIN links AS l ON l.id = c.link_id
AND (c.date_added >= '2016-11-01 00:00:00'
AND c.date_added <= '2016-11-16 23:59:59');
Upvotes: 1