Reputation: 17928
Reading this wiki article, I found out that the SELECT performance is killed if using IN() clauses with indexed columns in a MySQL database. My question is, how can I rewrite my query so that it won't use any IN() clause while still keeping its functionality?
My query is:
SELECT
`Route`.`route_id`, `Route`.`order`, `Route2`.`order`
FROM
`routes` AS `Route`
INNER JOIN
`routes` AS `Route2`
ON `Route`.`route_id` = `Route2`.`route_id`
WHERE
`Route`.`station_line_id` IN ([10 values]) AND
`Route2`.`station_line_id` IN ([10 values]) AND
`Route`.`order` <= `Route2`.`order`
GROUP BY `
`Route`.`station_line_id`, `Route2`.`station_line_id`, (`Route2`.`order` - `Route`.`order`)
and I have indexed all columns (route_id, station_line_id, station_id and line_id), with the id column being the primary key (the table is just read-only once generated, so no worries for indexing everything). The [10 values]
in the IN() clause are comma separated, like: IN(1, 2, ..., 10)
.
Basically, I self join the table routes table and group the results to get the desired records. The other joins are used for retrieving associated data.
Performance-wise, using InnoDB storage engine, I execute a similar query in >30seconds. Using MyISAM, I get >5seconds. But I believe results can be fetched even faster. I have ~4.5 million records in the table.
Upvotes: 2
Views: 2274
Reputation: 17535
You'll get the best performance in a query like this using a 'Hash index'. The 'standard' index is a B+ tree, which allows you to lookup entries in log(n) time where n is the number of rows in the table. They also maintain sorted order, so you can efficiently do queries like ... WHERE station_line_id > 14
, so that's what you'll want to use on your Order
column.
In your case, however, with an IN
clause, you're only looking for equivalence. In that case, a B+ tree is going to have to lookup all m of your "[10 values]" separately, costing you m * log(n) time, which is apparently taking 5-30 seconds.
A hash index is used to lookup equivalent entries in a constant amount of time (very fast) which doesn't depend (theoretically) on the number of rows in your table -- it will always be very fast even on large tables. The downside of a hash index is that you can't use it to do queries like <
or >
, but it's the fastest at equivalence queries like the ones you're doing in your IN
clause in station_line_id
.
Edit: For MySQL specifically, they unfortunately don't support HASH indexes on any of their popular database engines. If you're able to use the MEMORY or HEAP engine, then you could use a HASH index -- and having everything in memory will likely improve performance quite a bit anyways. Worth a shot.
Upvotes: 1