Reputation: 41
We have a MySQL table that looks something like this (insignificant columns removed):
CREATE TABLE `my_data` (
`auto_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`created_ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`updated_ts` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`data_txt` varchar(256) CHARACTER SET utf8 NOT NULL,
`issued_ts` timestamp NULL DEFAULT NULL,
`account_id` int(11) NOT NULL,
PRIMARY KEY (`auto_id`),
KEY `account_issued_idx` (`account_id`,`issued_ts`),
KEY `account_issued_created_idx` (`account_id`,`issued_ts`,`created_ts`),
KEY `account_created_idx` (`account_id`,`created_ts`),
KEY `issued_idx` (`issued_ts`)
) ENGINE=InnoDB;
We have approximately 900M rows in the table, with one account_id accounting for more than 65% of those rows. I'm being asked to write queries across date ranges for both created_ts and issued_ts that depend upon the account_id, which appears to have a 1:1 functional dependence on the auto increment key.
A typical query would look like this:
SELECT *
FROM my_data
WHERE account_id = 1 AND
created_ts > TIMESTAMP('2012-01-01') AND
created_ts <= TIMESTAMP('2012-01-21')
ORDER BY created_ts DESC LIMIT 100;
An EXPLAIN on the query shows this:
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: my_data
type: range
possible_keys: account_issued_idx, account_issued_created_idx, account_created_idx,
key: account_issued_created_idx
key_len: 8
ref: NULL
rows: 365314721
Extra: Using where
The problem is that the query takes far too long and is eventually killed. I've let it run a couple of times and it brings the down the database host because the OS (Linux) runs out of swap space.
I've researched the issue, repeatedly, and have tried to break up the query into uncorrelated subqueries, forcing indexes, using an explicit SELECT clause, and limiting the window of the date range, but the result is the same: poor performance (too slow) and too taxing on the host (invariably dies).
My question(s) are:
Is it possible that a query can be formulated to slice the data into date ranges and perform acceptably for a real-time call? ( < 1s)
Are there optimizations that I'm missing, or may help, in order to get the performance I am being asked to get?
Any other suggestions, hints, or thoughts are welcomed.
Thanks
Upvotes: 4
Views: 1894
Reputation: 108841
This question is getting on in years. Still, there's a good answer.
The key to your struggle lies in your words insignificant columns removed. There aren't any insignificant columns when you do SELECT * .... ORDER BY X DESC LIMIT N
. That's because the entire resultset has to be picked up and shuffled. When you ask for all the columns in a complex table, that's a lot of data.
You have a good index for the WHERE
clause. It would also be good for the ORDER BY
clause if that didn't say DESC
in it.
What you want is a deferred join. Start by retrieving just the IDs of the rows you need.
SELECT auto_id
FROM my_data
WHERE account_id = 1 AND
created_ts > TIMESTAMP('2012-01-01') AND
created_ts <= TIMESTAMP('2012-01-21')
ORDER BY created_ts DESC
LIMIT 100
This will give you the list of auto_id
values for the columns you need. To order this list, MySql only has to shuffle the id and timestamp values. It's LOTS less data to handle.
Then you JOIN
that list of IDs to your main table and grab the results.
SELECT a.*
FROM my_data a
JOIN (
SELECT auto_id
FROM my_data
WHERE account_id = 1 AND
created_ts > TIMESTAMP('2012-01-01') AND
created_ts <= TIMESTAMP('2012-01-21')
ORDER BY created_ts DESC
LIMIT 100
) b ON a.auto_id = b.auto_id
ORDER BY a.created_ts DESC
Try this. It will probably save you a lot of time.
If you know a priori that both auto_id and created_ts are monotone increasing, then you can do even better. Your subquery can contain
ORDER BY auto_id DESC
LIMIT 100
That will reduce the data you need to shuffle even further.
Pro tip: avoid SELECT *
in production systems; instead enumerate the columns you actually need. There are lots of reasons for this.
Upvotes: 1
Reputation: 115630
Not sure why MySQL uses the (obviously) not best index. Besides forcing the index, can you try the EXPLAIN
plan on this variation:
SELECT *
FROM my_data
WHERE account_id = 1 AND
created_ts > TIMESTAMP('2012-01-01') AND
created_ts <= TIMESTAMP('2012-01-21')
ORDER BY account_id
, created_ts DESC
LIMIT 100;
Upvotes: 0
Reputation: 26739
Do not use function in the comparision. Calculate the timestamps and use the computed values, otherwise you can't use the index to compare created_ts, and it's the field that will filter million of rows from the resultset
Upvotes: 0
Reputation: 1905
Try MariaDB (or MySQL 5.6), as their Optimizer can do it faster. I am using it for some months, and for some queries like yours it's 1000% faster.
You need Index Condition Pushdown: http://kb.askmonty.org/en/index-condition-pushdown/
Upvotes: 0
Reputation: 4072
Seems mysql uses wrong index for this query, try to force another:
SELECT *
FROM my_data FORCE INDEX (`account_created_idx`)
WHERE account_id = 1 AND
created_ts > TIMESTAMP('2012-01-01') AND
created_ts <= TIMESTAMP('2012-01-21')
ORDER BY created_ts DESC LIMIT 100;
Upvotes: 4