Reputation: 105
I'm trying to optimise a left join involving two tables, but I can't get my head round possible indexes to speed up things. Table 1 contains 2171289 rows:
text_metadata_for_nzcorpus | CREATE TABLE `text_metadata_for_nzcorpus` (
`text_id` varchar(255) NOT NULL,
`newspaper` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
`year` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
`month` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
`day` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
`section` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
`subsection` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
`topics` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
`words` int(11) NOT NULL DEFAULT '0',
`cqp_begin` bigint(20) unsigned NOT NULL DEFAULT '0',
`cqp_end` bigint(20) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`text_id`),
KEY `newspaper` (`newspaper`),
KEY `year` (`year`),
KEY `month` (`month`),
KEY `day` (`day`),
KEY `section` (`section`),
KEY `subsection` (`subsection`),
KEY `topics` (`topics`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
The second table only contains 8584 rows:
db_dist_fb8ddyk760 | CREATE TABLE `db_dist_fb8ddyk760` (
`text_id` varchar(255) COLLATE utf8_bin DEFAULT NULL,
`beginPosition` int(11) DEFAULT NULL,
`endPosition` int(11) DEFAULT NULL,
`refnumber` mediumint(9) NOT NULL AUTO_INCREMENT,
KEY `refnumber` (`refnumber`),
KEY `text_id` (`text_id`)
) ENGINE=InnoDB AUTO_INCREMENT=16384 DEFAULT CHARSET=utf8 COLLATE=utf8_bin |
I need to run queries of the following type:
SELECT md.day as handle, count(db.text_id) as hits,
count(distinct db.text_id) as files FROM text_metadata_for_nzcorpus as md
LEFT JOIN db_dist_fb8ddyk760 as db on md.text_id = db.text_id
GROUP BY md.day;
This currently takes more than 5 seconds to process. Since it is one of quite a few queries I need to run before displaying the output on a webpage, I'd like to speed things up, if at all possible. Here's the output of "explain":
+----+-------------+-------+-------+---------------+---------+---------+----------------------+---------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+----------------------+---------+--------------------------+
| 1 | SIMPLE | md | index | day | day | 768 | NULL | 2452080 | Using index |
| 1 | SIMPLE | db | ref | text_id | text_id | 768 | cqpweb_db.md.text_id | 1 | Using where; Using index |
+----+-------------+-------+-------+---------------+---------+---------+----------------------+---------+--------------------------+
Any helpful suggestions would be greatly appreciated. (I am not the developer of the system and I'm not in charge of the code as such - but I'd like to provide input to the programmer if things can be improved...)
Many thanks! Sebastian
Upvotes: 1
Views: 229
Reputation: 562310
Your EXPLAIN report shows that you're already using an index for both tables, and you're not using a temp table for the GROUP BY, and both tables are using covering indexes ("Using index").
Some other things you could do besides creating indexes:
NOT NULL
. This might get rid of the "Using where" note, which means it has to evaluate an expression as part of the search. That could be slightly more efficient.innodb_buffer_pool_size
enough so that the indexes can be cached in memory. You can get better performance and less disk I/O if your query reads the index pages from the buffer pool only.Re your comment:
BTW, the table db_dist_fb8ddyk760 is likely to be used only once or twice and then discarded.
Then why are you storing it in the durable database at all?
Consider using an in-memory key/value store like Redis. Make each key corresponds to a day, and each value is a structure containing the number of hits and the set of distinct text_id's. This is basically making a summary table (which you could do in SQL too), but Redis is in-memory.
Upvotes: 1
Reputation: 142298
Don't use VARCHAR(255)
blindly. Use datatypes that make sense for the data. Many of those columns sound like numbers, not strings.
Assuming that year+month+day are simply parts of a DATE
, use a single column with datatype DATE
. Then, extract the date using DAY(date_col)
.
Every InnoDB table should have a PRIMARY KEY
. Perhaps the combo of (text_id, beginPosition)
is unique and could be the PK?
Every column an be NULL
?? I doubt it. Make them NOT NULL
except where you have a reason for NULL
.
refnumber
is AUTO_INCREMENT
, but not the PRIMARY KEY
?? What gives?
Making the above changes will help some. But the query as stated is destined to scan the entire 2M-row table and reach into the other table a lot. Things can be done. But they will involve building and maintaining a Summary table.
Upvotes: 2