Reputation: 2353
I have a table that has a few million rows in it and I am querying the table and want to know if I can optimize the query by adding indexes or anything.
Table Schema:
CREATE TABLE `aggregate_data` (
`impressions` int(10) unsigned NOT NULL,
`clicks` int(10) unsigned NOT NULL,
`leads` int(10) unsigned NOT NULL,
`date` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`country` varchar(2) COLLATE utf8_bin NOT NULL,
`campaign_id` int(10) unsigned NOT NULL,
`payout` decimal(12,6) NOT NULL,
`revenue` decimal(12,6) NOT NULL,
`creative_id` int(10) unsigned NOT NULL DEFAULT '0',
`advertiser_id` int(11) unsigned NOT NULL DEFAULT '0',
`offer_id` int(11) unsigned NOT NULL DEFAULT '0',
`affiliate_id` int(11) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`country`,`campaign_id`,`date`),
KEY `date_added` (`date`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
SQL Query:
SELECT
DATE_FORMAT(`date`, "%Y-%m-01 00:00:00") AS `date`,
offer_id,
country,
@sum_impressions := SUM(impressions),
@sum_clicks := SUM(clicks),
@sum_leads := SUM(leads),
@sum_payout := SUM(payout),
@sum_revenue := SUM(revenue)
FROM aggregate_data
WHERE `date` >= '2012-12-00 00:00:00'
GROUP BY country, offer_id, MONTH(`date`), YEAR(`date`)
When I do an explain it always tells me it uses all the rows in the table.
+----+-------------+----------------+------+---------------+------+---------+------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------+------+---------------+------+---------+------+--------+----------------------------------------------+
| 1 | SIMPLE | aggregate_data | ALL | date_added | NULL | NULL | NULL | 809715 | Using where; Using temporary; Using filesort |
+----+-------------+----------------+------+---------------+------+---------+------+--------+----------------------------------------------+
It uses "Using where" because of the WHERE clause, "Using temporary" because of the group by and the filesort because of the group by also (i think).
Now what other indexes or something should i be adding so optimize this query. As the rows get large it takes a few seconds to run.
The variables like "@sum_impressions" are used because this SELECT statement is part of an "INSERT INTO ... ON DUPLICATE KEY UPDATE" statement.
Upvotes: 3
Views: 93
Reputation: 1269803
Hmmm, this is a case where the index on where might be making things worse. Indexes can make things worse, because the query scans the index and then reads the original data, but out of sequence. If the data is larger than memory, then you might get unnecessary cache misses.
One solution is to partition your data by date.
One idea is to put an index on date, country, and offer_id (one index, three parts).
I'm not sure if this will work. It solves the where
problem but only half the group by
.
If the year and month were separate columns, and the query looked like:
WHERE year >= 2012 and
GROUP BY country, offer_id, month, year
Then an index on (year, month, country, offer_id) could satisfy the where
and group by
just using the index. I'm not sure what happens when mixing different granularities of date. That leads back to partitioning by date (probably at the month level) and then indexing on count, offer_id, and date. (Some databases actually support functional indexes where you can use year(date) in the index.)
Upvotes: 1
Reputation: 735
Just looking at the table structure:
country
,campaign_id
,date
)" constraint to maintain uniqueness. Upvotes: 1
Reputation: 57398
Several approaches are possible.
You could use RANGE
partitioning to divide the table by year.
You could run a batch and store each month's totals, then run the query on the totals
table. Seeing as the aggregate_data
seems, from its name, to already be batch generated, it might not be too difficult or expensive.
You could try by indexing on date, country, offer_id
, but I think you might get better results (and waste much more disk space, as well as INSERT
time) if you indexed on date, country, offer_id, impressions, clicks, leads, payout, revenue
(this way, all of the data the query requires is already in the index; no access to the data table is needed. Of course this comes at a price - INSERT
performances). By splitting date
into year
and month
you could achieve even better results, select-performance-wise.
If I had your problem, I'd test the super-index performances (but give the whole app a good shakedown; different parts might experience a different performance impact); then I'd try and go with the batch solution, even if it means having two tables and handling synchronization.
Upvotes: 1