Reputation: 2126
I want to partition a very large table. As the business is growing, partitioning by date isn't really that good because each year the partitions get bigger and bigger. What I'd really like is a partition for every 10 million records.
The Mysql manual show this simple example:
CREATE TABLE employees (
id INT NOT NULL,
fname VARCHAR(30),
lname VARCHAR(30),
hired DATE NOT NULL DEFAULT '1970-01-01',
separated DATE NOT NULL DEFAULT '9999-12-31',
job_code INT NOT NULL,
store_id INT NOT NULL
)
PARTITION BY RANGE (store_id) (
PARTITION p0 VALUES LESS THAN (6),
PARTITION p1 VALUES LESS THAN (11),
PARTITION p2 VALUES LESS THAN (16),
PARTITION p3 VALUES LESS THAN MAXVALUE
);
But this means that everything larger than 16 and less than MAXVALUE gets thrown in the last partition. Is there a way to auto-generate a new partition every interval (in my case, 10 million records) so I won't have to keep modifying an active database? I am running Mysql 5.5
Thanks!
EDIT: Here is my actual table
CREATE TABLE `my_table` (
`row_id` int(11) NOT NULL AUTO_INCREMENT,
`filename` varchar(50) DEFAULT NULL,
`timestamp` datetime DEFAULT NULL,
`unit_num` int(3) DEFAULT NULL,
`string` int(3) DEFAULT NULL,
`voltage` float(6,4) DEFAULT NULL,
`impedance` float(6,4) DEFAULT NULL,
`amb` float(6,2) DEFAULT NULL,
`ripple_v` float(8,6) DEFAULT NULL,
PRIMARY KEY (`row_id`),
UNIQUE KEY `timestamp` (`timestamp`,`filename`,`string`,`unit_num`),
KEY `index1` (`filename`),
KEY `index2` (`timestamp`),
KEY `index3` (`timestamp`,`filename`,`string`),
KEY `index4` (`filename`,`unit_num`)
) ENGINE=MyISAM AUTO_INCREMENT=690892041 DEFAULT CHARSET=latin1
and an example query for the graph is...
SELECT DATE_FORMAT(timestamp,'%Y/%m/%d %H:%i:%s') as mytime,voltage,impedance,amb,ripple_v,unit_num
FROM my_table WHERE timestamp >= DATE_SUB('2015-07-31 00:05:59', INTERVAL 90 DAY)
AND filename = 'dlrphx10s320upsab3' and unit_num='5' and string='2'ORDER BY timestamp asc;
Here is the explain for the query...
mysql> explain SELECT DATE_FORMAT(timestamp,'%Y/%m/%d %H:%i:%s') as mytime,voltage,impedance,amb,ripple_v,unit_num FROM my_table WHERE timestamp >= DATE_SUB('2015-07-31 00:05:59', INTERVAL 90 DAY) AND filename = 'dlrphx10s320upsab3' and unit_num='5' and string='2'ORDER BY timestamp asc;
+----+-------------+------------+------+-------------------------+--------+---------+-------------+-------+----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+------+-------------------------+--------+---------+-------------+-------+----------------------------------------------------+
| 1 | SIMPLE | unit_tarma | ref | timestamp,index3,index4 | index4 | 58 | const,const | 13440 | Using index condition; Using where; Using filesort |
+----+-------------+------------+------+-------------------------+--------+---------+-------------+-------+----------------------------------------------------+
Upvotes: 5
Views: 3752
Reputation: 142538
(This answer is directed at the schema and SELECT.)
Since you anticipate millions of rows, first I want to point out some improvements to the schema.
FLOAT(m,n)
is usually the 'wrong' thing to do because it leads to two roundings. Either use plain FLOAT
(which seems 'right' for metrics like voltage) or use DECIMAL(m,n)
. FLOAT
is 4 bytes; in the cases given, DECIMAL
would be 3 or 4 bytes.
When you have both INDEX(a)
and INDEX(a,b)
, the former is unnecessary since the latter can cover for such. You have 3 unnecessary KEYs. This slows down INSERTs
.
INT(3)
-- Are you saying a "3-digit number"? If so consider TINYINT UNSIGNED
(values 0..255) for 1 byte instead of INT
for 4 bytes. This will save many MB of disk space, hence speed. (See also SMALLINT
, etc, and SIGNED
or UNSIGNED
.)
If filename
is repeated a lot, you may want to "normalize" it. This would save many MB.
Use NOT NULL
unless you need NULL
for something.
AUTO_INCREMENT=690892041
implies that you are about 1/3 of the way to disaster with id
, which will top out at about 2 billion. Do you use id
for anything? Getting rid of the column would avoid the issue; and change the UNIQUE KEY
to PRIMARY KEY
. (If you do need id
, let's talk further.)
ENGINE=MyISAM
-- Switching has some ramifications, both favorable and unfavorable. The table would become 2-3 times as big. The 'right' choice of PRIMARY KEY
would further speed up this SELECT
significantly. (And may or may not slow down other SELECTs
.)
A note on the SELECT
: Since string
and unit_num
are constants in the query, the last two fields of ORDER BY timestamp asc, string asc, unit_num asc
are unnecessary. If they are relevant for reasons not apparent in the SELECT
, then my advice may be incomplete.
This
WHERE filename = 'foobar'
AND unit_num='40'
AND string='2'
AND timestamp >= ...
is optimally handled by INDEX(filename, unit_name, string, timestamp)
. The order of the columns is not important except that timestamp
needs to be last. Rearranging the current UNIQUE
key, you give you the optimal index. (Meanwhile, none of the indexes is very good for this SELECT
.) Making it the PRIMARY KEY
and the table InnoDB would make it even faster.
Partitioning? No advantage. Not for performance; not for anything else you have mentioned. A common use for partitioning is for purging 'old'. If you intend to do such, let's talk further.
In huge tables it is best to look at all the important SELECTs
simultaneously so that we don't speed up one while demolishing the speed of others. It may even turn out that partitioning helps in this kind of tradeoff.
Upvotes: 4
Reputation: 142538
First, I must ask what benefit Partitioning gives you? Is there some query that runs faster because of it?
There is no auto-partitioning.
Instead, you should have a job that runs every day and it counts the number of rows in the 'last active' partition to see if it is about 10M. If so, add another partition.
I recommend keeping the "last" partition (the one with MAXVALUE
) empty. That way you can REORGANIZE PARTITION
to split it into two empty partitions with essentially zero overhead. And I recommend that instead of ADD PARTITION
because you might slip up and put something in the last partition.
It is unclear what will trigger 10M. Are there multiple rows for each store_id? And are there new rows coming in for each store? If so, then partitioning on store_id since all partitions will be growing all the time.
OK, so store_id was just a lame example from the reference manual. Please provide SHOW CREATE TABLE
so we can talk concrete, not hand-waving. There are simply too many ways to take this task.
What is the activity?
If you mostly hit the "recent" partition(s), then an uneven distribution may be warrantied -- periodically add a new partition and combine an adjacent pair of old partitions. (I did this successfully in one system.)
If you will be purging "old" data, then clearly you need to use PARTITION BY RANGE(TO_DAYS(...))
and use DROP PARTITION
plus REORGANIZE PARTITION
.
And there are lots of other scenarios. But I know of only 4 scenarios where Partitioning provides any performance benefit. See my blog.
Upvotes: 0