Reputation: 490
I know this question has been asked a few times but I couldn't make my query fast enough by reading those answers.
Basically I have a table here with 400k rows. It used to have more than 1.8m rows bu the query time is over 17 secs so I have a cron job to cut off the records older than 5 days in that table to keep the records around 400k rows so the query time is just over 5 secs and 5 secs is still slow. We have another a few tables that involve more than 2m records and use JOIN so I prefer to solve this trend table first to gain more exp then touch the others to increase the query performance in more complicated cases.
Data structure:
| _id | doctype | subtype | term | user_id | nug_id | source | timestamp | confidence |
|-----|---------|---------|------|---------|---------|--------|-----------|------------|
| 123 | post | keyword | games| 1000 | 200 | twitter| 143389203 | 0.0123 |
I have indexed term
, timestamp
, source
, confidence
.
Normally my query is:
SELECT term, SUM(confidence) AS relevance FROM trends
WHERE source IN ("twitter", "tumblr", "instagram", "post", "flickr")
GROUP BY term ORDER BY relevance DESC
And here is my result:
Showing rows 0 - 29 (165032 total, Query took 5.8050 sec)
So what should I do next to optimize the index or query to increase the performance. I can now foresee that how bad my query time is gonna be when I query with JOIN.
Add1: Sorry I forgot to attach the EXPLAIN output.
Add2: Table structure
CREATE TABLE `trends` (
`_id` bigint(20) NOT NULL AUTO_INCREMENT,
`doctype` varchar(10) DEFAULT NULL,
`subtype` varchar(20) DEFAULT NULL,
`term` varchar(200) DEFAULT NULL,
`user_id` varchar(100) DEFAULT NULL,
`nug_id` varchar(100) DEFAULT NULL,
`timestamp` bigint(20) DEFAULT NULL,
`source` varchar(100) DEFAULT NULL,
`confidence` float DEFAULT NULL,
PRIMARY KEY (`_id`),
KEY `confidence` (`confidence`),
KEY `give_me_trends` (`user_id`,`source`),
KEY `term` (`term`,`source`),
KEY `timestamp` (`timestamp`,`confidence`),
KEY `source` (`source`)
) ENGINE=InnoDB AUTO_INCREMENT=95350350 DEFAULT CHARSET=utf8
Add3:
After created a new table called test_trends
and copied the data from the trends
table, I tested with source
column as integer. Also I removed two columns doctype
and subtype
as they are not needed at all. Queried as below:
SELECT term, SUM(confidence) AS relevance FROM test_trends
WHERE source IN (1,2,3,4,5,6,7)
GROUP BY term ORDER BY relevance DESC
in 5.4802 sec.
EXPLAIN as below:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
|-----|-------------|-------------|--------|-------------------|---------|-----------|--------|--------|----------------------------------------------|
| 1 | SIMPLE | test_trends | index | source,source_2 | term_2 | 603 | NULL | 354324 | Using where; Using temporary; Using filesort |
Add4:
My test table structure:
CREATE TABLE `test_trends` (
`_id` bigint(20) NOT NULL AUTO_INCREMENT,
`term` varchar(200) DEFAULT NULL,
`user_id` varchar(100) DEFAULT NULL,
`nug_id` varchar(100) DEFAULT NULL,
`timestamp` bigint(20) DEFAULT NULL,
`source` tinyint(1) DEFAULT NULL,
`confidence` float DEFAULT NULL,
PRIMARY KEY (`_id`),
KEY `confidence` (`confidence`),
KEY `give_me_trends` (`user_id`,`source`),
KEY `term` (`term`,`source`),
KEY `timestamp` (`timestamp`,`confidence`),
KEY `source` (`source`),
KEY `term_2` (`term`),
KEY `source_2` (`source`,`confidence`,`timestamp`)
) ENGINE=InnoDB AUTO_INCREMENT=95354268 DEFAULT CHARSET=utf8
Also I indexed term
, source
, confidence
, timestamp
.
Add5:
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 0 | PRIMARY | 1 | _id | A | 379365 | NULL | NULL | | BTREE | | |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1 | confidence | 1 | confidence | A | 18 | NULL | NULL | YES | BTREE | | |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1 | give_me_trends | 1 | user_id | A | 149 | NULL | NULL | YES | BTREE | | |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1 | give_me_trends | 2 | source | A | 556 | NULL | NULL | YES | BTREE | | |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1 | term | 1 | term | A | 379365 | NULL | NULL | YES | BTREE | | |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1 | term | 2 | source | A | 379365 | NULL | NULL | YES | BTREE | | |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1 | timestamp | 1 | timestamp | A | 13548 | NULL | NULL | YES | BTREE | | |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1 | timestamp | 2 | confidence | A | 189682 | NULL | NULL | YES | BTREE | | |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1 | source | 1 | source | A | 107 | NULL | NULL | YES | BTREE | | |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1 | term_2 | 1 | term | A | 379365 | NULL | NULL | YES | BTREE | | |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1 | source_2 | 1 | source | A | 18 | NULL | NULL | YES | BTREE | | |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1 | source_2 | 2 | confidence | A | 189 | NULL | NULL | YES | BTREE | | |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1 | source_2 | 3 | timestamp | A | 189682 | NULL | NULL | YES | BTREE | | |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
Upvotes: 4
Views: 120
Reputation: 142528
INDEX(source, term, confidence)
-- Start with source
for the filtering (WHERE
), continue with the rest of the columns that are used in the query. "Covering" means that the query will complete in the index without reaching over to the data. Having the columns in this order may eliminate the temp table and sort for GROUP BY
(but not for the ORDER BY
).EXPLAIN
). This will be especially useful if the index is too big to be cached in the buffer_pool.(200)
if practical.INDEX(a)
is not needed if you have INDEX(a,b)
.PRIMARY KEY
.ORDER BY
being different than the GROUP BY
. There is probably no significant advantage in moving the ORDER BY
in application code.Upvotes: 1
Reputation: 26804
Besides what the other answers already mentioned one thing that jumps to me is that you are searching for ints on a varchar column,making the column not sargable(can`t use an index).
From that search I would guess you are only storing numbers in source so make it an INT column if that`s the case.
Upvotes: 1
Reputation: 562891
It's going to be quite hard to optimize this query. You have two things in the query that an index can help:
IN()
) can be helped by an index on the source
column, but the optimizer will not choose this index if the matching rows are more than approximately 20% of the rows in the table.GROUP BY
can be helped by an index on the term
column, to make the query read the table in the order of values in that column.But you can use an index to help one or the other of these features of the query, but not both.
You're doing a complete index-scan of your term_2
index, which is almost as costly as a table-scan. You can see from the EXPLAIN that it's visiting over 354,000 leaves of that index.
You're also getting Using temporary; using filesort
I would define all columns as NOT NULL
if they aren't supposed to be nullable. As I recall, this helps to avoid the Using where
note.
You should define a covering index, to make sure the query doesn't need to read any data outside of the index structure itself. Create an index on the columns (term, source, confidence)
. Make sure the term
column is first in that index, the order of the other two columns doesn't matter.
Make sure you increase the innodb_buffer_pool_size
to hold the index in memory.
Upvotes: 3
Reputation: 1234
Try removing order by
and do your sorting in your application logic.
Hope it can minimize your query load.
Upvotes: 1