Reputation: 41

MySQL inefficient query on Large data set

We have a MySQL table that looks something like this (insignificant columns removed):

CREATE TABLE `my_data` (
  `auto_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `created_ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `updated_ts` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  `data_txt` varchar(256) CHARACTER SET utf8 NOT NULL,
  `issued_ts` timestamp NULL DEFAULT NULL,
  `account_id` int(11) NOT NULL,
  PRIMARY KEY (`auto_id`),
  KEY `account_issued_idx` (`account_id`,`issued_ts`),
  KEY `account_issued_created_idx` (`account_id`,`issued_ts`,`created_ts`),
  KEY `account_created_idx` (`account_id`,`created_ts`),
  KEY `issued_idx` (`issued_ts`)
) ENGINE=InnoDB;

We have approximately 900M rows in the table, with one account_id accounting for more than 65% of those rows. I'm being asked to write queries across date ranges for both created_ts and issued_ts that depend upon the account_id, which appears to have a 1:1 functional dependence on the auto increment key.

A typical query would look like this:

SELECT * 
FROM my_data 
WHERE account_id = 1 AND 
      created_ts > TIMESTAMP('2012-01-01') AND 
      created_ts <= TIMESTAMP('2012-01-21') 
ORDER BY created_ts DESC LIMIT 100;

An EXPLAIN on the query shows this:

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: my_data
         type: range
possible_keys: account_issued_idx, account_issued_created_idx, account_created_idx,
      key: account_issued_created_idx
  key_len: 8
      ref: NULL
     rows: 365314721
    Extra: Using where

The problem is that the query takes far too long and is eventually killed. I've let it run a couple of times and it brings the down the database host because the OS (Linux) runs out of swap space.

I've researched the issue, repeatedly, and have tried to break up the query into uncorrelated subqueries, forcing indexes, using an explicit SELECT clause, and limiting the window of the date range, but the result is the same: poor performance (too slow) and too taxing on the host (invariably dies).

My question(s) are:

Is it possible that a query can be formulated to slice the data into date ranges and perform acceptably for a real-time call? ( < 1s)
Are there optimizations that I'm missing, or may help, in order to get the performance I am being asked to get?

Any other suggestions, hints, or thoughts are welcomed.

Thanks

Upvotes: 4

Answers (5)

O. Jones

Reputation: 108841

This question is getting on in years. Still, there's a good answer.

The key to your struggle lies in your words insignificant columns removed. There aren't any insignificant columns when you do SELECT * .... ORDER BY X DESC LIMIT N. That's because the entire resultset has to be picked up and shuffled. When you ask for all the columns in a complex table, that's a lot of data.

You have a good index for the WHERE clause. It would also be good for the ORDER BY clause if that didn't say DESC in it.

What you want is a deferred join. Start by retrieving just the IDs of the rows you need.

        SELECT auto_id
          FROM my_data
         WHERE account_id = 1 AND 
              created_ts > TIMESTAMP('2012-01-01') AND 
              created_ts <= TIMESTAMP('2012-01-21') 
     ORDER BY created_ts DESC
        LIMIT 100

This will give you the list of auto_id values for the columns you need. To order this list, MySql only has to shuffle the id and timestamp values. It's LOTS less data to handle.

Then you JOIN that list of IDs to your main table and grab the results.

SELECT a.*
  FROM my_data a
  JOIN (
             SELECT auto_id
               FROM my_data
              WHERE account_id = 1 AND 
                    created_ts > TIMESTAMP('2012-01-01') AND 
                    created_ts <= TIMESTAMP('2012-01-21') 
           ORDER BY created_ts DESC
              LIMIT 100
       ) b ON a.auto_id = b.auto_id
 ORDER BY a.created_ts DESC

Try this. It will probably save you a lot of time.

If you know a priori that both auto_id and created_ts are monotone increasing, then you can do even better. Your subquery can contain

      ORDER BY auto_id DESC
         LIMIT 100

That will reduce the data you need to shuffle even further.

Pro tip: avoid SELECT * in production systems; instead enumerate the columns you actually need. There are lots of reasons for this.

Upvotes: 1

ypercubeᵀᴹ

Reputation: 115630

Not sure why MySQL uses the (obviously) not best index. Besides forcing the index, can you try the EXPLAIN plan on this variation:

SELECT * 
FROM my_data 
WHERE account_id = 1 AND 
      created_ts > TIMESTAMP('2012-01-01') AND 
      created_ts <= TIMESTAMP('2012-01-21') 
ORDER BY account_id
       , created_ts DESC 
LIMIT 100;

Upvotes: 0

Maxim Krizhanovsky

Reputation: 26739

Do not use function in the comparision. Calculate the timestamps and use the computed values, otherwise you can't use the index to compare created_ts, and it's the field that will filter million of rows from the resultset

Upvotes: 0

Moshe L

Reputation: 1905

Try MariaDB (or MySQL 5.6), as their Optimizer can do it faster. I am using it for some months, and for some queries like yours it's 1000% faster.

You need Index Condition Pushdown: http://kb.askmonty.org/en/index-condition-pushdown/

Upvotes: 0

vearutop

Reputation: 4072

Seems mysql uses wrong index for this query, try to force another:

SELECT * 
FROM my_data FORCE INDEX (`account_created_idx`)
WHERE account_id = 1 AND 
      created_ts > TIMESTAMP('2012-01-01') AND 
      created_ts <= TIMESTAMP('2012-01-21') 
ORDER BY created_ts DESC LIMIT 100;

Upvotes: 4

MySQL inefficient query on Large data set

Answers (5)

Related Questions