Reputation: 307

MySQL time series data partitioning

I'm trying to implement a new database schema template to store/retrieve efficiently time series data for a "park".
A park has multiple devices (and subdevices), each with it's signals. In a park there can be 2-5k devices, in some cases even more. Usually the time resolution of the signals is the same for a park, normally 5-10-15 minutes or 1 hour.

Since every park can have a different number of devices each with a different number of data signals with different time resolutions, I have to create a database template that works well in all cases.

In our system there is an API that is reading frequently the recent data (maximally the last week), while just sporadically the historical data (when the end user requests it via an interface). The same is valid for back-end processes that aggregate the new data, for example from 5 minutes to 1 hour resolution etc., and do it for the historical data only if manually requested. The historical data will also be used to do some offline analysis on the park with dedicated softwares.

Important is the ability to be able to migrate the database/tables and to restore them quickly in case of a problem.

I have in mind two options:

Use MySQL partitioning based on the date.
Have a "current data" table where all the signals data is stored for fast access, and then regularly move the "old" data to daily, monthly, yearly tables (chunks). This chunks could be adaptive such that all the tables have the same size (in terms of disk space used). This because it could happen that some devices or the entire park are offline for some time and there will be a data hole.

Would you have some other idea in mind that would better fit to the purpose, and highlight all the pros and cons of the different approaches?

Here some info about how the devices will be stored:

CREATE TABLE `Device` (
   `id` smallint(5) unsigned NOT NULL AUTO_INCREMENT,
   `devicetype_id` smallint(5) unsigned NOT NULL,
   `parent_id` smallint(5) unsigned DEFAULT NULL,
   `name` varchar(50) NOT NULL,
   `displayname` varchar(30) DEFAULT NULL,
   `status` tinyint(4) NOT NULL DEFAULT '1',
   PRIMARY KEY (`id`),
   UNIQUE KEY `dev_par` (`name`,`parent_id`)
) ENGINE=InnoDB

And how the data will be stored:

CREATE TABLE `Data_raw` (
   `id` int(11) NOT NULL AUTO_INCREMENT,
   `device_id` smallint(5) unsigned NOT NULL,
   `datetime` datetime NOT NULL COMMENT '[UTC] beginning of timestep',
   `value` float NOT NULL,
   `signal_id` smallint(5) NOT NULL,
   `modified` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
   PRIMARY KEY (`id`)
) ENGINE=InnoDB

Other Info:

Number of parks: 150, will soon be around 500. There will be 1 schema for each park.
A park has on avg 80Mln rows of data per year. History can arrive to 20 years max, but on avg we have/will arrive to 5 years. Considering that, considering Rick's estimate of 50b/row, we would arrive to ~5GB per year, so ~50GB for the history. Considering that AWS Aurora MySQL can arrive to 64TB, all the parks could fit in the database. In the worst case in the future we could split the parks of one client into different ~~databases~~ shards so that is not an issue.
For the database we will use MySQL Aurora on AWS, with a main database that will have currently 16GB of RAM and 4 vCPU (this we can also increment, optimized for background processes/insert of data), and there will be a read replica of that database with different HW specs optimized for operations on the side of API.
All the history has to be stored and available (for less frequent operations like historical analysis, data download, etc...).

What other infos could help you to understand/determine the better choice? Thank you for your time.

Upvotes: 2

Answers (2)

Rick James

Reputation: 142373

Partitioning is useful only if you intend to delete "old" data. More discussion: http://mysql.rjweb.org/doc.php/partitionmaint

You probably need indexes into Data_raw. And/or you probably need Summary tables .

If it is really UTC, consider using TIMESTAMP; that avoids timezone mangling.

If you can't have two readings in the same second, change the PK to (device_id, datetime) and toss the useless id.

""current data" table where all the signals data is stored for fast access" -- The above change to the PK leads to the "current data" for each device to be cluster together; no need for a separate table; not enough benefit from partitioning to depend on it for such.

"regularly move" -- not worth the programming effort.

"entire park are offline for some time" -- Fine. No, there won't be a "hole" of any significance.

modified seems useless and a waste of space.

Use InnoDB.

Give us some numbers. RAM size. Number of rows. Retention time. Number of parks. Etc. I have experience in this area; I would like to 'run the numbers' to see if there are other issues to point out.

More

PRIMARY KEY(device_id, datetime) -- if there could be dups, consider using INSERT ... ON DUPLICATE KEY UPDATE ... to insert or replace when a new row arrives. It's a single step.

Big tables can have indexes. Summary tables avoid the need for most indexes on the big table.

Summary tables have the limitation that you need to decide on a granularity of "time". In business applications, "day" is usually sufficient. For sensor monitoring, "hour" might be more appropriate. A goal is to fold an average of 10 or more rows of the raw data into one row of the Summary table.

Having multiple tables as a way of partitioning data is usually a mistake. It complicates the code without necessarily providing any benefits. PARTITION BY RANGE(TO_DAYS(...)) is better (though still has some clumsiness). (Note: TO_DAYS() can be replaced by date computations that, for example, turn a TIMESTAMP into top-of-hour boundaries -- if you want resolution to the hour. Similarly for 10-minute, etc.)

Sizing

Is this a correct analysis of the data volume?

500 parks
300M rows/yr/park -- 150B rows/yr -- 5000 rows inserted/sec
10+ year retention -- 1.5T rows
17 bytes/row (let's say 50 with overhead) -- 75TB
7 days 'hot' 75/52/10 = 150GB

Something needs to give.

Sharding (multiple servers, each handling a subset of the parks) will probably be needed.
5000 INSERTs/sec (if on a single machine) is possible, but we need to discuss how to do it. (I use the Rule of Thumb: "Out of the box, MySQL can handle 100 inserts/sec; more than that needs some discussion".)
How much (between 1 week and 10 years) do you need to have "online"?
In the above numbers, I already eliminated id (which, at INT was too small, and not useful), and modified.
To make removing 'old' data efficient: PARTITION by day if keeping only one week online; by year if keeping 10 (or more) years.
Varying the partition size is possible with BY RANGE, but it has a drawback when rearranging the sizes: combining, say 4 weeks to make a month, ties up the table while the REORGANIZE is being done.

Upvotes: 3

Sergei Rodionov

Reputation: 4539

Have you thought about using a time series database for your data?

Your proposed schema is of universal type (metric name is stored in signal_id column) and you should expect 30-70 bytes per time:value with a corresponding I/O load when reading and writing data. Compare that to less than 2 bytes for modern time series database such as Axibase TSD (my affiliation). Here are the compression tests. Feel free to post a small subset of data as others suggested to get some more specific feedback.

Upvotes: 0

MySQL time series data partitioning

Answers (2)

Related Questions