Reputation: 2353

Is it possible to optimize this mysql query?

I have a table that has a few million rows in it and I am querying the table and want to know if I can optimize the query by adding indexes or anything.

Table Schema:

CREATE TABLE `aggregate_data` (
  `impressions` int(10) unsigned NOT NULL,
  `clicks` int(10) unsigned NOT NULL,
  `leads` int(10) unsigned NOT NULL,
  `date` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  `country` varchar(2) COLLATE utf8_bin NOT NULL,
  `campaign_id` int(10) unsigned NOT NULL,
  `payout` decimal(12,6) NOT NULL,
  `revenue` decimal(12,6) NOT NULL,
  `creative_id` int(10) unsigned NOT NULL DEFAULT '0',
  `advertiser_id` int(11) unsigned NOT NULL DEFAULT '0',
  `offer_id` int(11) unsigned NOT NULL DEFAULT '0',
  `affiliate_id` int(11) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`country`,`campaign_id`,`date`),
  KEY `date_added` (`date`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

SQL Query:

SELECT 
    DATE_FORMAT(`date`, "%Y-%m-01 00:00:00") AS `date`, 
    offer_id,
    country,
    @sum_impressions := SUM(impressions), 
    @sum_clicks := SUM(clicks), 
    @sum_leads := SUM(leads), 
    @sum_payout := SUM(payout), 
    @sum_revenue := SUM(revenue)
FROM aggregate_data
WHERE `date` >= '2012-12-00 00:00:00'
GROUP BY country, offer_id, MONTH(`date`), YEAR(`date`)

When I do an explain it always tells me it uses all the rows in the table.

+----+-------------+----------------+------+---------------+------+---------+------+--------+----------------------------------------------+
| id | select_type | table          | type | possible_keys | key  | key_len | ref  | rows   | Extra                                        |
+----+-------------+----------------+------+---------------+------+---------+------+--------+----------------------------------------------+
|  1 | SIMPLE      | aggregate_data | ALL  | date_added    | NULL | NULL    | NULL | 809715 | Using where; Using temporary; Using filesort |
+----+-------------+----------------+------+---------------+------+---------+------+--------+----------------------------------------------+

It uses "Using where" because of the WHERE clause, "Using temporary" because of the group by and the filesort because of the group by also (i think).

Now what other indexes or something should i be adding so optimize this query. As the rows get large it takes a few seconds to run.

The variables like "@sum_impressions" are used because this SELECT statement is part of an "INSERT INTO ... ON DUPLICATE KEY UPDATE" statement.

Upvotes: 3

Answers (3)

Gordon Linoff

Reputation: 1269803

Hmmm, this is a case where the index on where might be making things worse. Indexes can make things worse, because the query scans the index and then reads the original data, but out of sequence. If the data is larger than memory, then you might get unnecessary cache misses.

One solution is to partition your data by date.

One idea is to put an index on date, country, and offer_id (one index, three parts).

I'm not sure if this will work. It solves the where problem but only half the group by.

If the year and month were separate columns, and the query looked like:

WHERE year >= 2012 and
GROUP BY country, offer_id, month, year

Then an index on (year, month, country, offer_id) could satisfy the where and group by just using the index. I'm not sure what happens when mixing different granularities of date. That leads back to partitioning by date (probably at the month level) and then indexing on count, offer_id, and date. (Some databases actually support functional indexes where you can use year(date) in the index.)

Upvotes: 1

TrippyD

Reputation: 735

Just looking at the table structure:

3 Part Primary key makes that table inefficient because of the way innodb does clustered indexes, whereas an "id autoincrement not null primary key" could help.
Then, add "unique (country,campaign_id,date)" constraint to maintain uniqueness.
Additionally, an index for the group by: "key (country, offer_id)", possibly also adding the month/year to that index.

Upvotes: 1

LSerni

Reputation: 57398

Several approaches are possible.

You could use RANGE partitioning to divide the table by year.
You could run a batch and store each month's totals, then run the query on the totals table. Seeing as the aggregate_data seems, from its name, to already be batch generated, it might not be too difficult or expensive.
You could try by indexing on date, country, offer_id, but I think you might get better results (and waste much more disk space, as well as INSERT time) if you indexed on date, country, offer_id, impressions, clicks, leads, payout, revenue (this way, all of the data the query requires is already in the index; no access to the data table is needed. Of course this comes at a price - INSERT performances). By splitting date into year and month you could achieve even better results, select-performance-wise.

If I had your problem, I'd test the super-index performances (but give the whole app a good shakedown; different parts might experience a different performance impact); then I'd try and go with the batch solution, even if it means having two tables and handling synchronization.

Upvotes: 1

Is it possible to optimize this mysql query?

Answers (3)

Related Questions