Reputation: 37341
Trying to use indexes more efficiently on massive data.
I have an open source application that logs millions of records to a MySQL database. I've used mysql databases for years in web development and I understand enough about choosing efficient field types, the basics of why/how indexes are useful, etc - but the sheer volume of data our application logs combined with the fact that it's hard to predict exactly which columns will be queried has me a bit under water.
The application logs events by players. We have a very advanced purge system but some servers are so busy, they have 50 million records after just eight weeks.
At that size, event with our existing indexes, queries may still take 30-90 seconds.
The primary table schema (minus existing indexes):
CREATE TABLE IF NOT EXISTS `prism_data` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`epoch` int(10) unsigned NOT NULL,
`action_id` int(10) unsigned NOT NULL,
`player_id` int(10) unsigned NOT NULL,
`world_id` int(10) unsigned NOT NULL,
`x` int(11) NOT NULL,
`y` int(11) NOT NULL,
`z` int(11) NOT NULL,
`block_id` mediumint(5) DEFAULT NULL,
`block_subid` mediumint(5) DEFAULT NULL,
`old_block_id` mediumint(5) DEFAULT NULL,
`old_block_subid` mediumint(5) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
WHERE
conditions most often include:
world_id/x/y/z
coordinates (queries all default to a radius around the user, so coordinates are almost always used)epoch
(all queries default to the last three days, users need to override this for longer timeframes)action_id
and/or player_id
(Half the time, users are looking for who did specific actions or for what actions a specific player caused. )GROUP BY
- By default the application groups by certain fields so that the user doesn't see 100 duplicate events for the same player/action/block, they can just see a single record with a count.
action_id
, player_id
, block_id
, DATE(FROM_UNIXTIME(epoch))
ORDER BY
is always prism_data.epoch DESC, x ASC, z ASC, y ASC, id DESC
. The epoch
is so that the user sees the most recent events first. The rest are so that a "rollback" engine gets things in the right order.
Here is an example query without order/group:
SELECT *
FROM prism_data
INNER JOIN prism_players p ON p.player_id = prism_data.player_id
INNER JOIN prism_actions a ON a.action_id = prism_data.action_id
INNER JOIN prism_worlds w ON w.world_id = prism_data.world_id
LEFT JOIN prism_data_extra ex ON ex.data_id = prism_data.id
WHERE w.world = 'DeuxTiersMondes'
AND (prism_data.x BETWEEN 668 AND 868)
AND (prism_data.y BETWEEN -33 AND 167)
AND (prism_data.z BETWEEN 358 AND 558);
LIMIT 1000;
Using an index: INDEX
location(
world_id,
x,
z,
y);
it still takes 15 seconds to find 1000 rows (or 50 seconds to find all 64735).
The explain for that query:
+----+-------------+------------+--------+---------------+----------+---------+--------------------------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+---------------+----------+---------+--------------------------------+------+--------------------------+
| 1 | SIMPLE | w | ref | PRIMARY,world | world | 767 | const | 1 | Using where; Using index |
| 1 | SIMPLE | prism_data | ref | location | location | 4 | minecraft.w.world_id | 6155 | Using index condition |
| 1 | SIMPLE | a | eq_ref | PRIMARY | PRIMARY | 4 | minecraft.prism_data.action_id | 1 | NULL |
| 1 | SIMPLE | p | eq_ref | PRIMARY | PRIMARY | 4 | minecraft.prism_data.player_id | 1 | NULL |
| 1 | SIMPLE | ex | ref | data_id | data_id | 4 | minecraft.prism_data.id | 1 | NULL |
+----+-------------+------------+--------+---------------+----------+---------+--------------------------------+------+--------------------------+
It just seems to me that looking for values this specific ought to be much faster. We're not even sorting/grouping in this query.
My questions:
I assume it makes the most sense to design an index for each of the common conditions I listed above. i.e. one index that combines world_id/x/y/z
, one that combines action_id/player_id
and one for epoch
. For certain queries this works well but for others it doesn't. For a query that used world_id, player_id, and epoch
it only chose the world_id/x/y/z
index.
world_id/player_id/epoch
? I can't really tell what logic mysql uses to choose which index fits best but I assume if an index uses more columns mysql needs, it'll chose that one. A slight perf hit on write is worth it if that will help my queries.Using filesort
which I know is a main pain point for performance.Sorry for the long read.
I'm doing a lot of profiling for 5 of our most common queries with different index setups but have a feeling that I may be missing some basics. I'd rather have some true experts school me in something I'm missing before I continue.
Upvotes: 4
Views: 470
Reputation: 108641
MySQL (and other RDMS systems) makes good use of covering indexes. So, if you're looking up, to use your example,
SELECT prism_data.id,
prism_data.action_id,
prism_data.world_id
FROM prism_data
INNER JOIN prism_worlds w ON w.world_id = prism_data.world_id
WHERE w.world = 'DeuxTiersMondes'
AND (prism_data.x BETWEEN 668 AND 868)
AND (prism_data.y BETWEEN -33 AND 167)
AND (prism_data.z BETWEEN 358 AND 558)
ORDER BY prism_data.id DESC
LIMIT 1000;
The following BTREE index on prism_data will probably help a bunch with query performance (almost all indexes are BTREE indexes):
(world_id, x, y, z, id, action_id, world_id)
The whole of this query on prism_data can be satisfied just from the index. It's called a covering index because the server can find everything it needs to satisfy -- to cover -- the query in the index, and so doesn't have to bounce over to the data table itself. It'll do an index identity scan on world_id, then a range scan on x, and then look at the y, and z values for matching the rest of your query. It will then pull out the id values, order them, and return the LIMIT 1000 partial result set.
You should absolutely stop using SELECT *
. When you say SELECT *
you deny MySQL any knowledge of what columns of data you actually need, so you defeat the optimizer's logic that chooses covering index queries over raw table queries.
If your data are fairly evenly distributed over x and y, and you can use MyISAM, you may want to look into using geospatial indexes. These do a better job of random-accessing x/y ranges than ordinary indexes.
Elaborate index setups do slow down insertion and update; it's definitely a tradeoff.
Upvotes: 1
Reputation: 16475
Just a quick note because this sort of thing is seen over and over again: The JOIN on the prism_worlds
is unneeded because you (most probably) don't need the data from that table. You are basically asking the database "give me every name of worlds for which the name is equal to 'something'". Use a scalar subquery instead.
Create a unique index on prism_worlds.world
and run the query like
SELECT *
FROM prism_data
WHERE prism_data.world_id = (SELECT w.world_id FROM prism_worlds AS w WHERE w.world = 'DeuxTiersMondes')
LIMIT 1000;
The optimizer will figure out that prism_data.world_id
is constrained to a single constant value. MySQL will run a query ahead of time to figure out this value and use it all along the query. See EXPLAIN
for the const
-subquery executed.
Regarding to prism_data.x
, .y
and .z
: You may want to create a geometry column and a spatial index for that. If you need to stick to seperate values, you may want to separate the entire world geometry into voxels of fixed size (represented by a single int) and use simple geometry to figure out which position falls into which voxel.
My personal solution would be not give too many thoughts on adding gazillions of queries on this table. The indexes will make it slow and big. Use a cron job to fill a reporting table (materialized view) to produce the results ahead of time and use them as long as the cron job comes around and updates them again.
Upvotes: 1
Reputation: 615
MySQL can use compound indexes if provided with the first n columns of the index. So if you have a compound index on columns a,b,c,d, then MySQL can use that index if you provide columns a,b. OTOH, MySQL could not use the index if you only provided columns B,C,D in your query. Depending on the combinations of which columns you may use in your query, it may or may not make sense to include columns in multiple indexes. Don't forget that there's an additional cost to insert/update/delete a row for each column/index.
I don't think MySQL has ordered indexes so I suspect that indexing will not help with ordering performance but I'm not sure.
Depending on how you use your data it may make sense to investigate partitioning the table, perhaps by epoch.
Upvotes: 0