user2527658
user2527658

Reputation: 31

Mysql Partitioning effect on Index

I have multiple big tables for business data with smallest one having 38million rows(24G data, 26G index size). I have indexes setup to speed up the lookups and buffer pool set to 80% of total RAM(116G). Even after these settings, over time we have started observing performance issues. I have constraints with the disk size(1T) and sharding is not an option currently. The data growth has increased to 0.5M rows per day. This is leading to frequent optimisation and master switch exercises. Table schemas and indexes have already been optimised. Hence, I have started looking at partitioning the table to improve performance. My primary partitioning use case is to delete data on monthly basis by dropping partitions so that optimisations are not required and read/write latencies are improved. Following is the structure for one of the big tables(column names have been changed for legal reasons - assume that the columns where indexes are defined have lookup use cases):

   CREATE TABLE `table_name` (
     `id` int(11) NOT NULL AUTO_INCREMENT,
     `data_1` int(11) NOT NULL,
     `data_2` varchar(40) COLLATE utf8_unicode_ci NOT NULL,
     `data_3` varchar(50) COLLATE utf8_unicode_ci DEFAULT NULL,
     `data_4` varchar(20) COLLATE utf8_unicode_ci DEFAULT NULL,
     `created_at` datetime DEFAULT NULL,
     `updated_at` datetime DEFAULT NULL,
     PRIMARY KEY (`id`),
     KEY `index_data1` (`data_1`),
     KEY `index_data2` (`data_2`)
   ) ENGINE=InnoDB AUTO_INCREMENT=100572 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

I am planning to partition on the created_at column. However, the problem is that the partitioning column has to be part of all the unique keys. I can add the created_at column to the primary key but that would lead to increase in index size which in turn has its own side effects. Is there some workaround or any better solution?

Apart from solving this problem, there are few more questions whose answers couldn't be found in any documentation or articles present. 1. Why does mysql warrant partitioning column to be part of unique key? 2. The queries from the ORM don't have created_at clause present that means pruning is not possible with reads which we were okay with provided inserts are always pruned. However, doesn't look like this is the case. Why does mysql open all the partitions for inserts?

Mysql Version - 5.6.33-79.0-log Percona Server (GPL), Release 79.0, Revision 2084bdb

Upvotes: 2

Views: 2369

Answers (1)

Rick James
Rick James

Reputation: 142538

PRIMARY KEY(id, created_at) will take only an tiny bit more space than PRIMARY KEY(id). I estimate it at much less than 1% for your data. I can't tell about the index space -- can you show us the non-primary index(es)?

Explanation: The leaf nodes of the data (which is a BTree organized by the PK), will not change in size. The non-leaf nodes will have created_at added to each 'row'. As a rule of thumb in InnoDB, non-leaf nodes take up about 1% of the space for the BTree.

For the INDEX BTrees, the leaf nodes need an extra 4 bytes/row for created_at unless created_at is already in the index.

Let's say you currently have INDEX(foo) where foo is INT and id is also INT. That's a total of 8 bytes (plus overhead). Adding created_at (a 4-byte TIMESTAMP) expands each leaf 'row' to 12+overhead. So, that index may double in size.

A guess: Your 24G+26G might grow to 25G+33G.

It sounds like you have several indexes. You do understand that INDEX(a) is not useful if you also have INDEX(a,b)? And that INDEX(x,y) is a lot better than INDEX(x), INDEX(y) in some situations? Let's discuss your indexes.

The main benefit for PARTITIONing is your use case -- DROP PARTITION is a lot faster than DELETE. My blog on such.

Don't be lulled by partitioning. You are hoping for "read/write latencies are improved"; such is not likely to happen. If you would like further explanation please provide a SELECT where you think it might happen.

How many "months" will you partition on? I recommend not more than 50. PARTITIONing has some inefficiencies when there are lots of partitions.

Because of the need for the partition key to be in UNIQUE keys, the uniqueness constraint is almost totally useless. Having it on the end of an AUTO_INCREMENT id is not an issue.

Consider whether something other than id can be the PK.

Question 1: When INSERTing a row, all UNIQUE keys are immediately checked for "dup key". Without the partition key being part of the unique key, this would mean probing every partition. This is too costly to contemplate; so it was not done. (In the future, a 'global-to-the-table' UNIQUE key may be implemented. Version 8.0 has some hooks for such.)

Question 2a: Yes, if the SELECT's WHERE does not adequately specify the partition key, all partitions will be opened and looked at. This is another reason to minimize the number of partitions. Hmmm... If you do a SELECT on the 31st of the month and do the same SELECT the next day, you could get fewer rows (even without any deletes, just the DROP PARTITION); this seems "wrong".

Question 2b: "Why does mysql open all the partitions for inserts?" -- What makes you think it does? There is an odd case where the "first" partition is 'unnecessarily' opened -- the partition key is DATETIME.

Upvotes: 5

Related Questions