Reputation: 1323
We have a large MySQL table (device_data) with the following columns:
ID (int)
dt (timestamp)
serial_number (char(20))
data1 (double)
data2 (double)
... // other columns
The table receives around 10M rows every day.
We have done a sharding by separating the table based on the date of the timestamp (device_data_YYYYMMDD). However, we feel this is not effective because most of our queries (shown below) always check on the "serial_number" and will perform across many dates.
SELECT * FROM device_data WHERE serial_number = 'XXX' AND dt >= '2018-01-01' AND dt <= '2018-01-07';
Therefore, we think that creating the sharding based on the serial number will be more effective. Basically, we will have:
device_data_<serial_number>
device_data_0012393746
device_data_7891238456
Hence, when we want to find data for a particular device, we can easily reference as:
SELECT * FROM device_data_<serial_number> WHERE dt >= '2018-01-01' AND dt <= '2018-01-07';
This approach seems to be effective because:
A few challenges that we think we will face is:
Update. Below is the show create table result from our existing table:
CREATE TABLE `test_udp_new` (
`id` int(20) unsigned NOT NULL AUTO_INCREMENT,
`dt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`device_sn` varchar(20) NOT NULL,
`gps_date` datetime NOT NULL,
`lat` decimal(10,5) DEFAULT NULL,
`lng` decimal(10,5) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `device_sn_2` (`dt`,`device_sn`),
KEY `dt` (`dt`),
KEY `data` (`data`) USING BTREE,
KEY `test_udp_new_device_sn_dt_index` (`device_sn`,`dt`),
KEY `test_udp_new_device_sn_data_dt_index` (`device_sn`,`data`,`dt`)
) ENGINE=InnoDB AUTO_INCREMENT=44449751 DEFAULT CHARSET=latin1 ROW_FORMAT=DYNAMIC
The most frequent queries being run:
SELECT *
FROM test_udp_new
WHERE device_sn = 'xxx'
AND dt >= 'xxx'
AND dt <= 'xxx'
ORDER BY dt DESC;
Upvotes: 1
Views: 1283
Reputation: 142453
Addressing the query
PRIMARY KEY (`id`),
KEY `device_sn_2` (`dt`,`device_sn`),
KEY `dt` (`dt`),
KEY `data` (`data`) USING BTREE,
KEY `test_udp_new_device_sn_dt_index` (`device_sn`,`dt`),
KEY `test_udp_new_device_sn_data_dt_index` (`device_sn`,`data`,`dt`)
-->
PRIMARY KEY(`device_sn`,`dt`, id),
INDEX(id)
KEY `dt_sn` (`dt`,`device_sn`),
KEY `data` (`data`) USING BTREE,
Notes:
device_sn, dt
, you get the clustering benefits to make the query with WHERE device_sn = .. AND dt BETWEEN ...
INDEX(id)
is to keep AUTO_INCREMENT
happy.INDEX(a,b)
, INDEX(a)
is redundant.(20)
is meaningless; id
will max out at about 4 billion.lng decimal(10,5)
-- Don't need 5 decimal places to left of point; only need 3 or 2. So: lat decimal(7,5),
lng decimal(8,5)`. This will save a total of 3 bytes per row.Upvotes: 0
Reputation: 142453
The optimal way to handle that query is in a non-partitioned table with
INDEX(serial_number, dt)
Even better is to change the PRIMARY KEY
. Assuming you currently have id AUTO_INCREMENT
because there is not a unique combination of columns suitable for being a "natural PK",
PRIMARY KEY(serial_number, dt, id), -- to optimize that query
INDEX(id) -- to keep AUTO_INCREMENT happy
If there are other queries that are run often, please provide them; this may hurt them. In large tables, it is a juggling task to find the optimal index(es).
Other Comments:
serial_number
first in the PRIMARY KEY
, all queries referring to a single serial_number are likely to benefit.serial_numbers
? No problem.DELETEs
are much more costly than DROP PARTITION
. That involves PARTITION BY RANGE(TO_DAYS(dt))
. If you are interested in that, my PK suggestion still stands. (And the query in question will run about the same speed with or without this partitioning.)DOUBLE
? FLOAT
has about 7 significant digits of precision and takes only 4 bytes.serial_number
fixed at 20 characters? If not, use VARCHAR
. Also, CHARACTER SET ascii
may be better than the default of utf8? Upvotes: 1