Reputation: 694
I recently upgraded to MyQSL 8 from 5.7 (by spinning up a snapshot of my production RDS instance with the updated engine, if that makes any difference) and I'm having an issue on a ~64M rows table (raw_records
), when querying for the latest record available for each weather station.
| station_id | tstamp | temp | rel_hum |
|------------|---------------------|----------|----------|
| 1 | 2022-12-06 07:00:00 | 23.40000 | 34.00000 |
| 1 | 2022-12-06 08:00:00 | 23.70000 | 34.00000 |
| 2 | 2022-12-06 07:00:00 | 20.20000 | 43.00000 |
| 2 | 2022-12-06 08:00:00 | 19.80000 | 38.00000 |
...
The query (#1 below) used to work fine, taking few hundreds of ms to execute for up about 350 station IDs. After the database update however the query took too long so the request failed due to timeout. It turns out that the running time is now somehow proportional to the number of station IDs passed: with just 10 IDs the original query takes now ~1.5sec, with 100 stations about 20sec and so on. After some digging and experimenting it turns out the raw query below (#2) works just fine, executing in about 150-200ms for any number of stations.
Below you can see the execution time compared when running the two queries one after the other (using Laravel v9.x):
$station_ids = [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ];
DB::enableQueryLog();
// original query (#1)
RawRecord::selectRaw('`station_id`, MAX(`tstamp`) AS `latest_record`')
->whereIn('station_id', $station_ids->toArray())
->groupBy('station_id')
->get();
// new query (#2)
DB::select("
SELECT `station_id`, MAX(`tstamp`) AS `latest_record`
FROM `raw_records`
WHERE `station_id` IN ( {$station_ids->join(',')} )
GROUP BY `station_id`
");
dd(DB::getQueryLog());
[
[
"query" => "select `station_id`, MAX(`tstamp`) AS `latest_record` from `raw_records` where `station_id` in (?, ?, ?, ?, ?, ?, ?, ?, ?, ?) group by `station_id`"
"bindings" => [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ]
"time" => 1524.56
],
[
"query" => """
SELECT `station_id`, MAX(`tstamp`) AS `latest_record`
FROM `raw_records`
WHERE `station_id` IN ( 1,2,3,4,5,6,7,8,9,10 )
GROUP BY `station_id`
"""
"bindings" => []
"time" => 148.63
]
]
Now, my question is why is this happening, what could be the issue here? It appears the resulting executed SQL command is the same exact one, the only difference is the first one is created via QueryBuilder
.
Edit
This is the output of explain
for both queries; I'm not familiar with this statement and how to interpret the result, however I would say there is some issue related to the PK and indexes used to perform the query. Just to clarify I do define the PK in the model class:
// RawRecord
protected $primaryKey = ['station_id', 'tstamp'];
RawRecord::selectRaw('`station_id`, MAX(`tstamp`) AS `latest_record`')
->whereIn('station_id', $station_ids->toArray())
->groupBy('station_id')
->explain();
[
"id" => 1
"select_type" => "SIMPLE"
"table" => "raw_records"
"partitions" => null
"type" => "range"
"possible_keys" => "PRIMARY,station_id,tstamp"
"key" => "PRIMARY"
"key_len" => "4"
"ref" => null
"rows" => 123280
"filtered" => 100.0
"Extra" => "Using where; Using index"
]
DB::select("
EXPLAIN SELECT `station_id`, MAX(`tstamp`) AS `latest_record`
FROM `raw_records`
WHERE `station_id` IN ( {$station_ids->join(',')} )
GROUP BY `station_id`
");
[
"id" => 1
"select_type" => "SIMPLE"
"table" => "raw_records"
"partitions" => null
"type" => "range"
"possible_keys" => "PRIMARY,station_id,tstamp"
"key" => "station_id"
"key_len" => "4"
"ref" => null
"rows" => 368
"filtered" => 100.0
"Extra" => "Using where; Using index for group-by"
]
Upvotes: 1
Views: 805
Reputation: 141
1. Checking indexes
Make sure that you have a composite index (station_id, tstamp)
.
If you have this index, MySQL will most likely pick it up.
2. If adding an index doesn't work (the index is already there)
You'll have to use USE INDEX or FORCE INDEX.
USE INDEX(station_id+tstamp)
(the composite index name).
The USE INDEX (index_list) hint tells MySQL to use only one of the named indexes to find rows in the table. The alternative syntax IGNORE INDEX (index_list) tells MySQL to not use some particular index or indexes. These hints are useful if EXPLAIN shows that MySQL is using the wrong index from the list of possible indexes.
The FORCE INDEX hint acts like USE INDEX (index_list), with the addition that a table scan is assumed to be very expensive. In other words, a table scan is used only if there is no way to use one of the named indexes to find rows in the table.
3. Long story
The difference between these 2 queries is that the raw one doesn't use bindings. It looks like MySQL 8 query optimiser has some opinions about bindings and fails to estimate which index is more appropriate if bindings are used for an IN
query statement. It may be reasonable if there are too many values in the IN
expression so checking the station_id
index for each of them is longer than doing a table scan. Especially if there is no composite index station_id+tstamp
so MySQL will have to do a "partial scan" for selected rows to perform GROUP BY
and calculate MAX
.
I saw similar cases probably 2 times in the last 10 years. Every time people were resisting and trying to avoid USE INDEX
. I definitely agree. We should avoid using USE INDEX
and first try to understand why MySQL is not using our index, and create a proper one. In most cases, MySQL makes better decisions. However, sometimes, very rarely - it fails.
Note: USE INDEX
is still softer than FORCE INDEX
. It may be appropriate to use FORCE INDEX
in case if the table is huge and you're sure that anything will be better than the table scan. I'd start with USE INDEX
, it's usually sufficient.
Upvotes: 1