We have a large MySQL table (device_data) with the following columns: ID (int) dt (timestamp) serial_number (char(20)) data1 (double) data2 (double) ... // other columns The table receives around 10M rows every day. We have done a sharding by separating the table based on the date of the timestamp (device_data_YYYYMMDD) . However, we feel this is not effective because most of our queries (shown below) always check on the "serial_number" and will perform across many dates. SELECT * FROM device_data WHERE serial_number = 'XXX' AND dt >= '2018-01-01' AND dt <= '2018-01-07'; Therefore, we think that creating the sharding based on the serial number will be more effective. Basically, we will have: device_data_<serial_number> device_data_0012393746 device_data_7891238456 Hence, when we want to find data for a particular device, we can easily reference as: SELECT * FROM device_data_<serial_number> WHERE dt >= '2018-01-01' AND dt <= '2018-01-07'; This approach seems to be effective because: The application at all time will access the data based on the device first. We have checked that there is no query that access the data without specifying the device serial number first. The table for each device will be relatively small (9000 rows per day) A few challenges that we think we will face is: We have alot of devices. This means that the table device_data_ will be alot too. I have checked that MySQL does not provide limitation in the number of tables in the database. Will this impact on performance vs keeping them in one table? How will this impact on later on when we would like to scale MySQL (e.g. using master / slave, etc)? Are there other alternative / solution in resolving this? Update. Below is the show create table result from our existing table: CREATE TABLE `test_udp_new` ( `id` int(20) unsigned NOT NULL AUTO_INCREMENT, `dt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP, `device_sn` varchar(20) NOT NULL, `gps_date` datetime NOT NULL, `lat` decimal(10,5) DEFAULT NULL, `lng` decimal(10,5) DEFAULT NULL, PRIMARY KEY (`id`), KEY `device_sn_2` (`dt`,`device_sn`), KEY `dt` (`dt`), KEY `data` (`data`) USING BTREE, KEY `test_udp_new_device_sn_dt_index` (`device_sn`,`dt`), KEY `test_udp_new_device_sn_data_dt_index` (`device_sn`,`data`,`dt`) ) ENGINE=InnoDB AUTO_INCREMENT=44449751 DEFAULT CHARSET=latin1 ROW_FORMAT=DYNAMIC The most frequent queries being run: SELECT * FROM test_udp_new WHERE device_sn = 'xxx' AND dt >= 'xxx' AND dt <= 'xxx' ORDER BY dt DESC;

Reputation: 1323

MySQL Large Table Sharding to Smaller Table based on Unique ID

We have a large MySQL table (device_data) with the following columns:

ID (int)
dt (timestamp)
serial_number (char(20))
data1 (double)
data2 (double)
... // other columns

The table receives around 10M rows every day.

We have done a sharding by separating the table based on the date of the timestamp (device_data_YYYYMMDD). However, we feel this is not effective because most of our queries (shown below) always check on the "serial_number" and will perform across many dates.

SELECT * FROM device_data WHERE serial_number = 'XXX' AND dt >= '2018-01-01' AND dt <= '2018-01-07';

Therefore, we think that creating the sharding based on the serial number will be more effective. Basically, we will have:

device_data_<serial_number>
device_data_0012393746
device_data_7891238456

Hence, when we want to find data for a particular device, we can easily reference as:

SELECT * FROM device_data_<serial_number> WHERE dt >= '2018-01-01' AND dt <= '2018-01-07';

This approach seems to be effective because:

The application at all time will access the data based on the device first.
We have checked that there is no query that access the data without specifying the device serial number first.
The table for each device will be relatively small (9000 rows per day)

A few challenges that we think we will face is:

We have alot of devices. This means that the table device_data_ will be alot too. I have checked that MySQL does not provide limitation in the number of tables in the database. Will this impact on performance vs keeping them in one table?
How will this impact on later on when we would like to scale MySQL (e.g. using master / slave, etc)?
Are there other alternative / solution in resolving this?

Update. Below is the show create table result from our existing table:

CREATE TABLE `test_udp_new` (
 `id` int(20) unsigned NOT NULL AUTO_INCREMENT,
 `dt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
 `device_sn` varchar(20) NOT NULL,
 `gps_date` datetime NOT NULL,
 `lat` decimal(10,5) DEFAULT NULL,
 `lng` decimal(10,5) DEFAULT NULL,
 PRIMARY KEY (`id`),
 KEY `device_sn_2` (`dt`,`device_sn`),
 KEY `dt` (`dt`),
 KEY `data` (`data`) USING BTREE,
 KEY `test_udp_new_device_sn_dt_index` (`device_sn`,`dt`),
 KEY `test_udp_new_device_sn_data_dt_index` (`device_sn`,`data`,`dt`)
) ENGINE=InnoDB AUTO_INCREMENT=44449751 DEFAULT CHARSET=latin1 ROW_FORMAT=DYNAMIC

The most frequent queries being run:

SELECT  *
    FROM  test_udp_new
    WHERE  device_sn = 'xxx'
      AND  dt >= 'xxx'
      AND  dt <= 'xxx'
    ORDER BY  dt DESC;

Upvotes: 1

Answers (2)

Rick James

Reputation: 142453

Addressing the query

 PRIMARY KEY (`id`),
 KEY `device_sn_2` (`dt`,`device_sn`),
 KEY `dt` (`dt`),
 KEY `data` (`data`) USING BTREE,
 KEY `test_udp_new_device_sn_dt_index` (`device_sn`,`dt`),
 KEY `test_udp_new_device_sn_data_dt_index` (`device_sn`,`data`,`dt`)

-->

 PRIMARY KEY(`device_sn`,`dt`, id),
 INDEX(id)
 KEY `dt_sn` (`dt`,`device_sn`),
 KEY `data` (`data`) USING BTREE,

Notes:

By starting the PK with device_sn, dt, you get the clustering benefits to make the query with WHERE device_sn = .. AND dt BETWEEN ...
INDEX(id) is to keep AUTO_INCREMENT happy.
When you have INDEX(a,b), INDEX(a) is redundant.
The (20) is meaningless; id will max out at about 4 billion.
I tossed the last index because it is probably helped enough by the new PK.
lng decimal(10,5) -- Don't need 5 decimal places to left of point; only need 3 or 2. So: lat decimal(7,5),lng decimal(8,5)`. This will save a total of 3 bytes per row.

Upvotes: 0

Rick James

Reputation: 142453

The optimal way to handle that query is in a non-partitioned table with

INDEX(serial_number, dt)

Even better is to change the PRIMARY KEY. Assuming you currently have id AUTO_INCREMENT because there is not a unique combination of columns suitable for being a "natural PK",

PRIMARY KEY(serial_number, dt, id),  -- to optimize that query
INDEX(id)  -- to keep AUTO_INCREMENT happy

If there are other queries that are run often, please provide them; this may hurt them. In large tables, it is a juggling task to find the optimal index(es).

Other Comments:

There are very few use cases for which partitioning actually speed up processing.
Making lots of 'identical' tables is a maintenance nightmare, and, again, not a performance benefit. There are probably a hundred Q&A on stackoverflow shouting not to do such.
By having serial_number first in the PRIMARY KEY, all queries referring to a single serial_number are likely to benefit.
A million serial_numbers? No problem.
One common use case for partitioning involves purging "old" data. This is because big DELETEs are much more costly than DROP PARTITION. That involves PARTITION BY RANGE(TO_DAYS(dt)). If you are interested in that, my PK suggestion still stands. (And the query in question will run about the same speed with or without this partitioning.)
How many months before the table outgrows your disk? (If this will be an issue, let's discuss it.)
Do you need 8-byte DOUBLE? FLOAT has about 7 significant digits of precision and takes only 4 bytes.
You are using InnoDB?
Is serial_number fixed at 20 characters? If not, use VARCHAR. Also, CHARACTER SET ascii may be better than the default of utf8?
Each table (or each partition of a table) involves at least one file that the OS must deal with. When you have "too many", the OS groans, often before MySQL groans. (It is hard to make either "die" of overdose.)

Upvotes: 1

MySQL Large Table Sharding to Smaller Table based on Unique ID

Answers (2)

Related Questions