noob
noob

Reputation: 490

How to properly increase MySQL performance by indexing

I know this question has been asked a few times but I couldn't make my query fast enough by reading those answers.

Basically I have a table here with 400k rows. It used to have more than 1.8m rows bu the query time is over 17 secs so I have a cron job to cut off the records older than 5 days in that table to keep the records around 400k rows so the query time is just over 5 secs and 5 secs is still slow. We have another a few tables that involve more than 2m records and use JOIN so I prefer to solve this trend table first to gain more exp then touch the others to increase the query performance in more complicated cases.

Data structure:

| _id | doctype | subtype | term | user_id | nug_id  | source | timestamp | confidence |
|-----|---------|---------|------|---------|---------|--------|-----------|------------|
| 123 |  post   | keyword | games| 1000    | 200     | twitter| 143389203 |  0.0123    |

I have indexed term, timestamp, source, confidence.

Normally my query is:

SELECT term, SUM(confidence) AS relevance FROM trends 
WHERE source IN ("twitter", "tumblr", "instagram", "post", "flickr")
GROUP BY term ORDER BY relevance DESC

And here is my result:

Showing rows 0 - 29 (165032 total, Query took 5.8050 sec)

So what should I do next to optimize the index or query to increase the performance. I can now foresee that how bad my query time is gonna be when I query with JOIN.

Add1: Sorry I forgot to attach the EXPLAIN output.

enter image description here

Add2: Table structure

CREATE TABLE `trends` (
 `_id` bigint(20) NOT NULL AUTO_INCREMENT,
 `doctype` varchar(10) DEFAULT NULL,
 `subtype` varchar(20) DEFAULT NULL,
 `term` varchar(200) DEFAULT NULL,
 `user_id` varchar(100) DEFAULT NULL,
 `nug_id` varchar(100) DEFAULT NULL,
 `timestamp` bigint(20) DEFAULT NULL,
 `source` varchar(100) DEFAULT NULL,
 `confidence` float DEFAULT NULL,
 PRIMARY KEY (`_id`),
 KEY `confidence` (`confidence`),
 KEY `give_me_trends` (`user_id`,`source`),
 KEY `term` (`term`,`source`),
 KEY `timestamp` (`timestamp`,`confidence`),
 KEY `source` (`source`)
) ENGINE=InnoDB AUTO_INCREMENT=95350350 DEFAULT CHARSET=utf8

Add3:

After created a new table called test_trends and copied the data from the trends table, I tested with source column as integer. Also I removed two columns doctype and subtype as they are not needed at all. Queried as below:

SELECT term, SUM(confidence) AS relevance FROM test_trends 
WHERE source IN (1,2,3,4,5,6,7) 
GROUP BY term ORDER BY relevance DESC

in 5.4802 sec.

EXPLAIN as below:

| id  | select_type |    table    |   type |   possible_keys   |   key   |  key_len  |   ref  |  rows  |                     Extra                    |
|-----|-------------|-------------|--------|-------------------|---------|-----------|--------|--------|----------------------------------------------|
|  1  |   SIMPLE    | test_trends |  index |  source,source_2  |  term_2 |    603    |  NULL  | 354324 | Using where; Using temporary; Using filesort |

Add4:

My test table structure:

CREATE TABLE `test_trends` (
 `_id` bigint(20) NOT NULL AUTO_INCREMENT,
 `term` varchar(200) DEFAULT NULL,
 `user_id` varchar(100) DEFAULT NULL,
 `nug_id` varchar(100) DEFAULT NULL,
 `timestamp` bigint(20) DEFAULT NULL,
 `source` tinyint(1) DEFAULT NULL,
 `confidence` float DEFAULT NULL,
 PRIMARY KEY (`_id`),
 KEY `confidence` (`confidence`),
 KEY `give_me_trends` (`user_id`,`source`),
 KEY `term` (`term`,`source`),
 KEY `timestamp` (`timestamp`,`confidence`),
 KEY `source` (`source`),
 KEY `term_2` (`term`),
 KEY `source_2` (`source`,`confidence`,`timestamp`)
) ENGINE=InnoDB AUTO_INCREMENT=95354268 DEFAULT CHARSET=utf8

Also I indexed term, source, confidence, timestamp.

Add5:

+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table       | Non_unique | Key_name       | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 0          | PRIMARY        | 1            | _id         | A         | 379365      | NULL     | NULL   |      | BTREE      |         |               |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1          | confidence     | 1            | confidence  | A         | 18          | NULL     | NULL   | YES  | BTREE      |         |               |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1          | give_me_trends | 1            | user_id     | A         | 149         | NULL     | NULL   | YES  | BTREE      |         |               |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1          | give_me_trends | 2            | source      | A         | 556         | NULL     | NULL   | YES  | BTREE      |         |               |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1          | term           | 1            | term        | A         | 379365      | NULL     | NULL   | YES  | BTREE      |         |               |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1          | term           | 2            | source      | A         | 379365      | NULL     | NULL   | YES  | BTREE      |         |               |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1          | timestamp      | 1            | timestamp   | A         | 13548       | NULL     | NULL   | YES  | BTREE      |         |               |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1          | timestamp      | 2            | confidence  | A         | 189682      | NULL     | NULL   | YES  | BTREE      |         |               |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1          | source         | 1            | source      | A         | 107         | NULL     | NULL   | YES  | BTREE      |         |               |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1          | term_2         | 1            | term        | A         | 379365      | NULL     | NULL   | YES  | BTREE      |         |               |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1          | source_2       | 1            | source      | A         | 18          | NULL     | NULL   | YES  | BTREE      |         |               |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1          | source_2       | 2            | confidence  | A         | 189         | NULL     | NULL   | YES  | BTREE      |         |               |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1          | source_2       | 3            | timestamp   | A         | 189682      | NULL     | NULL   | YES  | BTREE      |         |               |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+

Upvotes: 4

Views: 120

Answers (4)

Rick James
Rick James

Reputation: 142528

  • (The most important suggestion.) Both queries would benefit from the "covering" INDEX(source, term, confidence) -- Start with source for the filtering (WHERE), continue with the rest of the columns that are used in the query. "Covering" means that the query will complete in the index without reaching over to the data. Having the columns in this order may eliminate the temp table and sort for GROUP BY (but not for the ORDER BY).
  • Normalize the fields (as you did with source) to shrink the data, thereby possibly speeding things up. (Note how the "key_len" is 603 in EXPLAIN). This will be especially useful if the index is too big to be cached in the buffer_pool.
  • Shorten the (200) if practical.
  • Get rid of redundant indexes - INDEX(a) is not needed if you have INDEX(a,b).
  • Is some combination of the columns "unique"? If so, we can discuss how turning that into the PRIMARY KEY.
  • Is the data "write only"? That is, do you add new rows, but never change old rows? If so, then we can talk about Summary Tables which might give you a 10x speedup. (This, if it applies, is even more important than the covering index.)
  • Using temp and using filesort is a necessity due to the ORDER BY being different than the GROUP BY. There is probably no significant advantage in moving the ORDER BY in application code.

My cookbook on indexing.

Upvotes: 1

Mihai
Mihai

Reputation: 26804

Besides what the other answers already mentioned one thing that jumps to me is that you are searching for ints on a varchar column,making the column not sargable(can`t use an index).

From that search I would guess you are only storing numbers in source so make it an INT column if that`s the case.

Upvotes: 1

Bill Karwin
Bill Karwin

Reputation: 562891

It's going to be quite hard to optimize this query. You have two things in the query that an index can help:

  • The range predicate (IN()) can be helped by an index on the source column, but the optimizer will not choose this index if the matching rows are more than approximately 20% of the rows in the table.
  • The GROUP BY can be helped by an index on the term column, to make the query read the table in the order of values in that column.

But you can use an index to help one or the other of these features of the query, but not both.

You're doing a complete index-scan of your term_2 index, which is almost as costly as a table-scan. You can see from the EXPLAIN that it's visiting over 354,000 leaves of that index.

You're also getting Using temporary; using filesort

I would define all columns as NOT NULL if they aren't supposed to be nullable. As I recall, this helps to avoid the Using where note.

You should define a covering index, to make sure the query doesn't need to read any data outside of the index structure itself. Create an index on the columns (term, source, confidence). Make sure the term column is first in that index, the order of the other two columns doesn't matter.

Make sure you increase the innodb_buffer_pool_size to hold the index in memory.

Upvotes: 3

John Roca
John Roca

Reputation: 1234

Try removing order by and do your sorting in your application logic.

Hope it can minimize your query load.

Upvotes: 1

Related Questions