Reputation: 37381

Optimizing GROUP BY performance - Adds roughly five seconds to query

I'm running a wide variety of profiling on a large database that's typical for users of our application. It stores millions of records and I've spent time making sure the field types are what they need to be (although we could also normalize and move three of these columns to foreign keys).

The issue that by default, queries group by related information and count repetitive records. This group by kills us - taking a query that runs in 0.08 seconds and slows it down to 5.89 on average.

An example query:

SELECT player, x, y, z, COUNT(id), action_type
FROM prism_actions WHERE world = 'world'
AND (prism_actions.x BETWEEN -1119.650147217701 AND -919.650147217701)
AND (prism_actions.y BETWEEN -33.0 AND 167.0)
AND (prism_actions.z BETWEEN 385.14867792476133 AND 585.1486779247614) 
AND prism_actions.action_time >= '2013-01-31 17:09:16'
GROUP BY prism_actions.block_id 
LIMIT 1000;

I've tried a wide variety of different queries our app might use and group by is one of largest performance hits.

Our current table structure:

CREATE TABLE IF NOT EXISTS `prism_actions` (
  `id` int(11) unsigned NOT NULL auto_increment,
  `action_time` timestamp NOT NULL default CURRENT_TIMESTAMP,
  `action_type` varchar(25) NOT NULL,
  `player` varchar(16) NOT NULL,
  `world` varchar(255) NOT NULL,
  `x` int(11) NOT NULL,
  `y` int(11) NOT NULL,
  `z` int(11) NOT NULL,
  `block_id` mediumint(5) default NULL,
  `block_subid` mediumint(5) default NULL,
  `old_block_id` mediumint(5) default NULL,
  `old_block_subid` mediumint(5) default NULL,
  `data` varchar(255) default NULL,
  PRIMARY KEY  (`id`),
  KEY `x` (`x`),
  KEY `action_type` (`action_type`),
  KEY `player` (`player`),
  KEY `block_id` (`block_id`)
) ENGINE=MyISAM  DEFAULT CHARSET=latin1 AUTO_INCREMENT=44525743 ;

We normally group by three fields but those don't impact performance much. I've tried playing with indexes (some have recommended combined indexes for other issues we've had, but there are so many variations of queries a user might request of the app - no way to really know which fields they'll be using).

How can I improve the performance of the grouping?

Upvotes: 5

Answers (4)

donramos

Reputation: 534

Try the following changes:

create a composite index that contains world, block_id, x, y, z and action time
make block_id "not null"
for x, y, z condition, round up the values since x, y, z are integers anyway

Upvotes: 1

Neo

Reputation: 5463

i run some script on my computer, i found that this query has 2 situation

first

you should caculate how many different value of world column, because there is a equal clause there, if you hava lots of different value of world, you can add index on world and block_id

alter table prism_actions add index world_block_id(world, block_id)

it will use this index for group by query.

second

add index to (x, y, z)

there would be lots of results be filtered, it does not cost much

EDIT:

may by you should rewrite your query like this

explain select player, x, y, z, count(a.id), action_type from prism_actions a inner join (select id from prism_actions
where (x BETWEEN -1119.650147217701 AND -919.650147217701)
AND (y BETWEEN -33.0 AND 167.0)
AND (z BETWEEN 385.14867792476133 AND 585.1486779247614)
AND action_time >= '2013-01-31 17:09:16') b on a.id=b.id group by a.block_id;

add index x_y_z

alert table prism_actions add index x_y_x(x, y, z);

Upvotes: 0

O. Jones

Reputation: 108651

You're using a nonstandard feature of MySQL in your query. Read this: http://dev.mysql.com/doc/refman/5.5/en/group-by-extensions.html

This feature is harmful on large tables, because it allows a programmer to conceal his intent from the MySQL engine. It's not clear what results you're going to get with all those items you've omitted from your GROUP BY clause.

Also, COUNT(*) is a little faster than COUNT(id), because it needs no logic to scrub out null values of id. Your table doesn't have them anyway, but still.

You probably should try to work with standard SQL as you optimize your query. Try this:

  SELECT block_id, player, 
         x, y, z, 
         COUNT(*) action_type
    FROM prism_actions 
   WHERE world = 'world'
     AND prism_actions.x BETWEEN -1119.650147217701 AND -919.650147217701
     AND prism_actions.y BETWEEN -33.0 AND 167.0
     AND prism_actions.z BETWEEN 385.14867792476133 AND 585.1486779247614
     AND prism_actions.action_time >= '2013-01-31 17:09:16'
GROUP BY block_id, player, x, y, z 
   LIMIT 1000

It will get you one row for every distinct value of block_id, player, and position.

But, it's possible you want this instead:

  SELECT block_id, player, 
         AVG(x) x, AVG(y) y, AVG(z) z, 
         COUNT(*) action_type
    FROM prism_actions 
   WHERE world = 'world'
     AND prism_actions.x BETWEEN -1119.650147217701 AND -919.650147217701
     AND prism_actions.y BETWEEN -33.0              AND 167.0
     AND prism_actions.z BETWEEN 385.14867792476133 AND 585.1486779247614 
     AND prism_actions.action_time >= '2013-01-31 17:09:16'
GROUP BY block_id, player 
   LIMIT 1000

This will give you a row for every matching block_id and player with the count of records and the average of x,y,z. You could also use MIN and MAX to get a bounding cube rather than an average position.

Assuming that you're also grouping by action_type (per your comment) try this:

  SELECT block_id, player, action_type 
         AVG(x) AS x, AVG(y) AS y, AVG(z) AS z, 
         COUNT(*) AS num
    FROM prism_actions 
   WHERE world = 'world'
     AND prism_actions.x BETWEEN -1119.650147217701 AND -919.650147217701
     AND prism_actions.y BETWEEN -33.0              AND 167.0
     AND prism_actions.z BETWEEN 385.14867792476133 AND 585.1486779247614 
     AND prism_actions.action_time >= '2013-01-31 17:09:16'
GROUP BY block_id, player, action_type 
   LIMIT 1000

Is there some reason you don't have indexes on y, z, and action_time? Your query seems to be very selective on those column values.

If you don't need the AVG(x) AS x, AVG(y) AS y, AVG(z) AS z columns in your result set, leave them out to save time.

What is the LIMIT clause doing for you? Keep in mind that the order of rows in a result set is, formally, unpredictable, so the LIMIT clause may select different data from time to time.

Upvotes: 0

Gordon Linoff

Reputation: 1269753

Assuming the same query performs well with three columns in the group by but not with one column, block_id, we can ask what is the difference?

The difference is that block_id has an index on it. You would think that indexes always improve performance, but this is not true. I would suggest that you remove the index on block_id and see if it improves your query performance.

What happens under the hood (assuming this fixes the performance problem) is that the index looks up a record, which then has to be fetched randomly from the table. The next record is probably located far away. Soon, the page cache is full and each read of a record requires reading a page from disk. A major performance bottleneck.

The sorting method does read the data, but more efficiently, because it reads all the records on a page at one time. So, removing the index might improve performance.

Upvotes: 0

Optimizing GROUP BY performance - Adds roughly five seconds to query

Answers (4)

Related Questions