Deepu
Deepu

Reputation: 2616

Creating Index after Partitioning the MySQL Table?

I have mysql table Stop_Times with 20 Million records.

I am using MyISAM storage engine for this table.

My table structure is

| agency_id           | varchar(50) | NO   |     | NULL    |       |
| trip_id             | varchar(50) | NO   | PRI | NULL    |       |
| arrival_time        | time        | NO   |     | NULL    |       |
| departure_time      | time        | NO   |     | NULL    |       |
| stop_id             | varchar(50) | NO   | PRI | NULL    |       |
| stop_sequence       | int(11)     | NO   | PRI | NULL    |       |
| route_id            | varchar(50) | NO   |     | NULL    |       |
| route_type          | int(5)      | NO   |     | NULL    |       |
+---------------------+-------------+------+-----+--------

The output of show create table Stop_Times is -

CREATE TABLE `Stop_Times` (
 `agency_id` varchar(50) NOT NULL,
 `trip_id` varchar(50) NOT NULL,
 `arrival_time` time NOT NULL,
 `departure_time` time NOT NULL,
 `stop_id` varchar(50) NOT NULL,
 `stop_sequence` int(11) NOT NULL,
 `route_id` varchar(50) NOT NULL,
 `route_type` int(5) NOT NULL,
 UNIQUE KEY `idx_Stop_Times` (`agency_id`,`trip_id`,`stop_sequence`,`stop_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
/*!50100 PARTITION BY KEY (stop_id)
PARTITIONS 250 */

I have one query

select distinct trip_id, stop_sequence from Stop_Times where agency_id = ? and stop_id = ?

Before partitioning the table this query took more than 4-5 mins.

But now I have partitioned the table on stop_id like partition by key(stop_id)

Now this query is taking 2-3 seconds to execute.

I have another query followed by the above query -

select distinct(stop_id) from Stop_Times where agency_id = ? and trip_id = ? and stop_sequence > ? 

The output of explain on above query is -

1   SIMPLE  Stop_Times  range   idx_Stop_Times  idx_Stop_Times  308 NULL    250 Using where; Using index; Using temporary

This query is taking more time 90 - 150 seconds.

So my question is do I need to create Index on trip_id and stop_sequence ? Will that speed up the query performance?

Do I need to change the storage engine from MyISAM from InnoDB since there are many reads(select queries) for this table from multiple users at a time?

Please Database gurus help me.

Thank You

Upvotes: 1

Views: 2831

Answers (2)

Justin Swanhart
Justin Swanhart

Reputation: 1856

Perhaps you want to consider sharding this data set.

I maintain a tool called Shard-Query which can query all of the shards in parallel. You have a natural sharding key (stop_id) which you are currently using for KEY partitioning. With Shard-Query you can use this same column with HASH partitioning, so you get functional equivalency.

With Shard-Query you would create 250 databases, each with an identical copy of the table. This is equivalent to your 250 partitions.

Partition elimination
When you execute the first query, Shard-Query will send the query only to the partition which contains the given stop_id. This is the same as MySQL partition pruning.

Massively Parallel Processing (MPP)
For the second query, Shard-Query will scan the partitions in parallel, based on the number of gearman (message queue) workers that you run. If you have a sixteen core machine, you can get 16 degrees of parallelism over the partitions instead of MySQL's single threaded scan of all the partitions.

You will see massive speed increases with the parallel scan as long as your server has the resources to handle the parallelism. If not, you can split the data into N servers (this is the massive in MPP) and you'll get linear scaling as you add nodes. Keep in mind that adding or removing shards requires reloading all data if you use HASH partitioning, so this should be infrequent.

One caveat:
Shard-Query supports COUNT(DISTINCT) but not SELECT DISTINCT ...
You can simply rewrite the queries to use GROUP BY:

Query #1

select trip_id, stop_sequence from Stop_Times where agency_id = ? and stop_id = ? group by trip_id, stop_sequence;

Query #2

select stop_id from Stop_Times where agency_id = ? and trip_id = ? and stop_sequence > ? group by stop_id;

Upvotes: 1

Mark Byers
Mark Byers

Reputation: 838416

Create a covering index on (agency_id, trip_id, stop_sequence, stop_id). Note that the order of the columns in the index is important. Using a different order may be less efficient.

Upvotes: 1

Related Questions