Reputation: 105

MySQL left join with group by - optimisation of indexes

I'm trying to optimise a left join involving two tables, but I can't get my head round possible indexes to speed up things. Table 1 contains 2171289 rows:

text_metadata_for_nzcorpus | CREATE TABLE `text_metadata_for_nzcorpus` (
    `text_id` varchar(255) NOT NULL,
    `newspaper` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
    `year` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
    `month` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
    `day` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
    `section` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
    `subsection` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
    `topics` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
    `words` int(11) NOT NULL DEFAULT '0',
    `cqp_begin` bigint(20) unsigned NOT NULL DEFAULT '0',
    `cqp_end` bigint(20) unsigned NOT NULL DEFAULT '0',
    PRIMARY KEY (`text_id`),
    KEY `newspaper` (`newspaper`),
    KEY `year` (`year`),
    KEY `month` (`month`),
    KEY `day` (`day`),
    KEY `section` (`section`),
    KEY `subsection` (`subsection`),
    KEY `topics` (`topics`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

The second table only contains 8584 rows:

db_dist_fb8ddyk760 | CREATE TABLE `db_dist_fb8ddyk760` (
    `text_id` varchar(255) COLLATE utf8_bin DEFAULT NULL,
    `beginPosition` int(11) DEFAULT NULL,
    `endPosition` int(11) DEFAULT NULL,
    `refnumber` mediumint(9) NOT NULL AUTO_INCREMENT,
    KEY `refnumber` (`refnumber`),
    KEY `text_id` (`text_id`)
) ENGINE=InnoDB AUTO_INCREMENT=16384 DEFAULT CHARSET=utf8 COLLATE=utf8_bin |

I need to run queries of the following type:

SELECT md.day as handle, count(db.text_id) as hits, 
    count(distinct db.text_id) as files FROM text_metadata_for_nzcorpus as md 
    LEFT JOIN db_dist_fb8ddyk760 as db on md.text_id = db.text_id 
    GROUP BY md.day;

This currently takes more than 5 seconds to process. Since it is one of quite a few queries I need to run before displaying the output on a webpage, I'd like to speed things up, if at all possible. Here's the output of "explain":

+----+-------------+-------+-------+---------------+---------+---------+----------------------+---------+--------------------------+
| id | select_type | table | type  | possible_keys | key     | key_len | ref                  | rows    | Extra                    |
+----+-------------+-------+-------+---------------+---------+---------+----------------------+---------+--------------------------+
|  1 | SIMPLE      | md    | index | day           | day     | 768     | NULL                 | 2452080 | Using index              |
|  1 | SIMPLE      | db    | ref   | text_id       | text_id | 768     | cqpweb_db.md.text_id |       1 | Using where; Using index |
+----+-------------+-------+-------+---------------+---------+---------+----------------------+---------+--------------------------+

Any helpful suggestions would be greatly appreciated. (I am not the developer of the system and I'm not in charge of the code as such - but I'd like to provide input to the programmer if things can be improved...)

Many thanks! Sebastian

Upvotes: 1

Answers (2)

Bill Karwin

Reputation: 562310

Your EXPLAIN report shows that you're already using an index for both tables, and you're not using a temp table for the GROUP BY, and both tables are using covering indexes ("Using index").

Some other things you could do besides creating indexes:

Define db_dist_fb8ddyk760.text_id as NOT NULL. This might get rid of the "Using where" note, which means it has to evaluate an expression as part of the search. That could be slightly more efficient.
Define db_dist_fb8ddyk760.text_id as the PRIMARY KEY of that table, if that makes sense—in other words, if text_id is unique in that table. That way the "type: ref" will become "type: eq_ref" meaning a unique key lookup, which is a little more efficient. But of course disregard this suggestion if this table needs to record multiple hits per text_id.
Increase your innodb_buffer_pool_size enough so that the indexes can be cached in memory. You can get better performance and less disk I/O if your query reads the index pages from the buffer pool only.
Take advantage of the MySQL Query Cache so if you run the same query again, it'll reuse the result from the previous query. But the query cache may be of little use if the data in these tables changes more frequently than you execute the query.
Consider caching the result in application memory or memcached or something.

Re your comment:

BTW, the table db_dist_fb8ddyk760 is likely to be used only once or twice and then discarded.

Then why are you storing it in the durable database at all?

Consider using an in-memory key/value store like Redis. Make each key corresponds to a day, and each value is a structure containing the number of hits and the set of distinct text_id's. This is basically making a summary table (which you could do in SQL too), but Redis is in-memory.

Upvotes: 1

Rick James

Reputation: 142298

Don't use VARCHAR(255) blindly. Use datatypes that make sense for the data. Many of those columns sound like numbers, not strings.

Assuming that year+month+day are simply parts of a DATE, use a single column with datatype DATE. Then, extract the date using DAY(date_col).

Every InnoDB table should have a PRIMARY KEY. Perhaps the combo of (text_id, beginPosition) is unique and could be the PK?

Every column an be NULL?? I doubt it. Make them NOT NULL except where you have a reason for NULL.

refnumber is AUTO_INCREMENT, but not the PRIMARY KEY?? What gives?

Making the above changes will help some. But the query as stated is destined to scan the entire 2M-row table and reach into the other table a lot. Things can be done. But they will involve building and maintaining a Summary table.

Upvotes: 2

MySQL left join with group by - optimisation of indexes

Answers (2)

Related Questions