mysqlsqlindexingdatabase-performancequery-performance

Reputation: 16333

MySQL query with JOIN not using INDEX

I have the following two tables in MySQL (Simplified).

clicks (InnoDB)
- Contains around about 70,000,000 records
- Has an index on the date_added column
- Has a column link_id which refers to a record in the links table
links (MyISAM)
- Contains far fewer records, around about 65,000

I'm trying to run some analytical queries using these tables. I need to pull out some data, about clicks that occurred inside of two specified dates while applying some other user selected filters using other tables and joining them into the links table.

My question revolves around the use of indexes however. When I run the following query:

SELECT
    COUNT(1)
FROM
    clicks
WHERE
    date_added >= '2016-11-01 00:00:00'
AND date_added <= '2016-11-03 23:59:59';

I get a response back in 1.40 sec. Using EXPLAIN I find that the MySQL uses the index on the date_added column as expected.

EXPLAIN SELECT COUNT(1) FROM clicks WHERE date_added >= '2016-11-01 00:00:00' AND date_added <= '2016-11-16 23:59:59';
+----+-------------+--------+-------+---------------+------------+---------+------+---------+--------------------------+
| id | select_type | table  | type  | possible_keys | key        | key_len | ref  | rows    | Extra                    |
+----+-------------+--------+-------+---------------+------------+---------+------+---------+--------------------------+
|  1 | SIMPLE      | clicks | range | date_added    | date_added | 4       | NULL | 1559288 | Using where; Using index |
+----+-------------+--------+-------+---------------+------------+---------+------+---------+--------------------------+

However, when I LEFT JOIN in my links table I find that the query takes much longer to execute:

SELECT
    COUNT(1) AS clicks
FROM
    clicks AS c
LEFT JOIN links AS l ON l.id = c.link_id
WHERE
    c.date_added >= '2016-11-01 00:00:00'
AND c.date_added <= '2016-11-16 23:59:59';

Which completed in 6.50 sec. Using EXPLAIN I find that the index was not used on the date_added column:

EXPLAIN SELECT COUNT(1) AS clicks FROM clicks AS c LEFT JOIN links AS l ON l.id = c.link_id WHERE c.date_added >= '2016-11-01 00:00:00' AND c.date_added <= '2016-11-16 23:59:59';
+----+-------------+-------+--------+---------------+------------+---------+---------------+---------+-------------+
| id | select_type | table | type   | possible_keys | key        | key_len | ref           | rows    | Extra       |
+----+-------------+-------+--------+---------------+------------+---------+---------------+---------+-------------+
|  1 | SIMPLE      | c     | range  | date_added    | date_added | 4       | NULL          | 6613278 | Using where |
|  1 | SIMPLE      | l     | eq_ref | PRIMARY       | PRIMARY    | 4       | c.link_id     |       1 | Using index |
+----+-------------+-------+--------+---------------+------------+---------+---------------+---------+-------------+

As you can see the index isn't being used for the date_added column in the larger table and seems to take far longer. This seems to get even worse when I join in other tables.

Does anyone know why this is happening or if there's anything I can do to get it to use the index on the date_added column in the clicks table?

Edit

I've just attempted to get my stats out of the database using a different method. The first step in my method involves pulling out a distinct set of link_ids from the clicks table. I've found that I'm seeing the same problem here again, without a JOIN. The index is not being used:

My query:

SELECT
    DISTINCT(link_id) AS link_id
FROM
    clicks
WHERE
    date_added >= '2016-11-01 00:00:00'
AND date_added <= '2016-12-05 10:16:00'

This query took almost a minute to complete. I ran an EXPLAIN on this and found that the query is not using the index as I expected it would:

+----+-------------+---------+-------+---------------+----------+---------+------+----------+-------------+
| id | select_type | table   | type  | possible_keys | key      | key_len | ref  | rows     | Extra       |
+----+-------------+---------+-------+---------------+----------+---------+------+----------+-------------+
|  1 | SIMPLE      | clicks  | index | date_added    | link_id  | 4       | NULL | 79786609 | Using where |
+----+-------------+---------+-------+---------------+----------+---------+------+----------+-------------+

I expected that it would use the index on date_added to filter down the result set and then pull out the distinct link_id values. Any idea why this is happening? I have an index on link_id as well as date_added.

Upvotes: 4

Answers (2)

O. Jones

Reputation: 108851

Do you want to use an ordinary JOIN in place of the LEFT JOIN? LEFT JOIN preserves all the rows on the right, so it will yield the same value of COUNT() as the unjoined table. If you want to count only the rows from your right-hand table that have matching rows in the left-hand table, use JOIN, not LEFT JOIN.

Try dropping your index on date_added and replacing it with a compound index on (date_added, link_id). This sort of index is called a covering index. When the query planner knows it can get everything it needs from an index, it doesn't have to bounce back to the table. In this case the query planner can random-access the index to the beginning of your date range, then do an index range scan to the end of the range. It's still going to have to refer to the other table, though.

(Edit) For the sake of experimentation, try a narrower date range. See if EXPLAIN changes. In that case, the query planner might be guessing your date_added column's cardinality wrong.

You might try an index hint. For example, try

SELECT COUNT(1) AS clicks
  FROM clicks AS c USE INDEX (date_added)
  LEFT JOIN links AS l ON l.id = c.link_id
 WHERE etc

But, judging from your EXPLAIN output, you're already doing a range scan on date_added. Your next step, like it or not, is the compound covering index.

Make sure there's an index on links(id). There probably is, because it's probably the PK.

Try using COUNT(*) instead of COUNT(1). It probably won't make a difference, but it's worth a try. COUNT(*) simply counts rows rather than evaluating something for each row it counts.

(Nitpick) Your date range smells funny. Use < for the end of your range for best results, like so.

 WHERE c.date_added >= '2016-11-01'
   AND c.date_added <  '2016-11-17';

Edit: Look, the MySQL query planner uses lots of internal knowledge about how tables are structured. And, it can only use one index per table to satisfy a query as of late 2016. That's a limitation.

SELECT DISTINCT column is actually a fairly complex query, because it has to de-dupe the column in question. If there's an index on that column, the query planner is likely to use it. Choosing that index means it could not choose some other index.

Compound indexes (covering indexes) sometimes but not always resolve this kind of index-selection dilemma, and allow index dual usage. You can read about all this at http://use-the-index-luke.com/

But if your operational constraints prevent the adding of compound indexes, you'll need to live with the one-second query. It isn't that bad.

Of course, saying you can't add compound indexes to get your job done is like this:

A: stuff is falling off my truck on the freeway.

B: put a tarp over the stuff and tie it down.

A: my boss won't let me put a tarp on the truck.

B: well, then, drive slow.

Upvotes: 2

Rahul

Reputation: 77926

Not absolutely sure but consider moving the condition from WHERE condition to JOIN ON condition since you are performing a outer join (LEFT JOIN) it makes difference in performance unlike inner join where the condition be it on where or join on clause is equivalent.

SELECT COUNT(1) AS clicks 
FROM clicks AS c 
LEFT JOIN links AS l ON l.id = c.link_id 
AND (c.date_added >= '2016-11-01 00:00:00' 
AND c.date_added <= '2016-11-16 23:59:59');

Upvotes: 1

MySQL query with JOIN not using INDEX

Answers (2)

Related Questions