Gabe Harriman
Gabe Harriman

Reputation: 75

How to optimize a query that's using group by on a large number of rows

The table looks like this:

    CREATE TABLE `tweet_tweet` (
      `id` int(11) NOT NULL AUTO_INCREMENT,
      `text` varchar(256) NOT NULL,
      `created_at` datetime NOT NULL,
      `created_date` date NOT NULL,
...
      `positive_sentiment` decimal(5,2) DEFAULT NULL,
      `negative_sentiment` decimal(5,2) DEFAULT NULL,
      `entity_id` int(11) DEFAULT NULL,
      PRIMARY KEY (`id`),
      KEY `tweet_tweet_entity_created` (`entity_id`,`created_at`)
    ) ENGINE=MyISAM AUTO_INCREMENT=1097134 DEFAULT CHARSET=utf8

The explain on the query looks like this:

mysql> explain SELECT `tweet_tweet`.`entity_id`, 
       STDDEV_POP(`tweet_tweet`.`positive_sentiment`) AS `sentiment_stddev`, 
       AVG(`tweet_tweet`.`positive_sentiment`) AS `sentiment_avg`, 
       COUNT(`tweet_tweet`.`id`) AS `tweet_count` 
       FROM `tweet_tweet` 
       WHERE `tweet_tweet`.`created_at` > '2010-10-06 16:24:43'  
       GROUP BY `tweet_tweet`.`entity_id` ORDER BY `tweet_tweet`.`entity_id` ASC;

+----+-------------+-------------+------+---------------+------+---------+------+---------+----------------------------------------------+
| id | select_type | table       | type | possible_keys | key  | key_len | ref  | rows    | Extra                                        |
+----+-------------+-------------+------+---------------+------+---------+------+---------+----------------------------------------------+
|  1 | SIMPLE      | tweet_tweet | ALL  | NULL          | NULL | NULL    | NULL | 1097452 | Using where; Using temporary; Using filesort |
+----+-------------+-------------+------+---------------+------+---------+------+---------+----------------------------------------------+
  1 row in set (0.00 sec)

About 300k rows are added to the table every day. The query runs about 4 seconds right now but I want to get it down to around 1 second and I'm afraid the query will take exponentially longer as the days go on. Total number of rows in tweet_tweet is currently only a little over 1M, but it will be growing fast.

Any thoughts on optimizing this? Do I need any more indexes? Should I be using something like Cassandra instead of MySQL? =)

Upvotes: 2

Views: 137

Answers (4)

Baris Akverdi
Baris Akverdi

Reputation: 11

if your mysql version 5.1 or higher ,you can consider partitioning option for large tables.

http://dev.mysql.com/doc/refman/5.1/en/partitioning.html

Upvotes: 0

cryo28
cryo28

Reputation: 1127

You may try to reorder fields in the index (i.e. KEY tweet_tweet_entity_created (created_at, entity_id). That will allow mysql to use the index to reduce the quantity of actual rows that need to be grouped and ordered).

Upvotes: 1

Wolfman2000
Wolfman2000

Reputation: 153

MySQL has a dirty little secret. When you create an index over multiple columns, only the first one is really "used". I've made tables that used Unique Keys and Foreign Keys, and I often had to set a separate index for one or more of the columns.

I suggest adding an extra index to just created_at at a minimum. I do not know if adding indexes to the aggregate columns will also speed things up.

Upvotes: 0

William
William

Reputation: 15593

You're not using the index tweet_tweet_entity_created. Change your query to:

explain SELECT `tweet_tweet`.`entity_id`, 
       STDDEV_POP(`tweet_tweet`.`positive_sentiment`) AS `sentiment_stddev`, 
       AVG(`tweet_tweet`.`positive_sentiment`) AS `sentiment_avg`, 
       COUNT(`tweet_tweet`.`id`) AS `tweet_count` 
       FROM `tweet_tweet` FORCE INDEX (tweet_tweet_entity_created)
       WHERE `tweet_tweet`.`created_at` > '2010-10-06 16:24:43'  
       GROUP BY `tweet_tweet`.`entity_id` ORDER BY `tweet_tweet`.`entity_id` ASC;

You can read more about index hints in the MySQL manual http://dev.mysql.com/doc/refman/5.1/en/index-hints.html

Sometimes MySQL's query optimizer needs a little help.

Upvotes: 0

Related Questions