Reputation: 997
I'm building a IoT system for home appliance stuff.
My data table has been created as
mysql> SHOW CREATE TABLE DataM1\G
*************************** 1. row ***************************
Table: DataM1
Create Table: CREATE TABLE `DataM1` (
`sensor_type` text,
`sensor_name` text,
`timestamp` datetime DEFAULT NULL,
`data_type` text,
`massimo` float DEFAULT NULL,
`minimo` float DEFAULT NULL,
KEY `timestamp_id` (`timestamp`) USING BTREE,
KEY `super_index_id` (`timestamp`,`sensor_name`(11),`data_type`(11)) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8
and the query is
SELECT
sensor_type, sensor_name, timestamp, data_type,
MAX(massimo) as massimo, MIN(minimo) as minimo
FROM DataM1
WHERE timestamp >= NOW() - INTERVAL 1 HOUR
GROUP BY timestamp, sensor_type, sensor_name, data_type;
Now, the problem is that when the table reaches 4 million (few days) rows the query takes 50+ seconds.
Edit: EXPLAIN result is as following:
id: 1
select_type: SIMPLE
table: DataM1
partitions: p0,p1,p2,p3,p4,p5,p6
type: range
possible_keys: timestamp_id,super_index_id
key: timestamp_id
key_len: 6
ref: NULL
rows: 1
filtered: 100.00
Extra: Using index condition; Using temporary; Using filesort
Edit: a sample row of reply is:
*************************** 418037. row ***************************
sensor_type: SEN
sensor_name: SEN_N2
timestamp: 2016-10-16 17:28:48
data_type: flow_rate
massimo: 17533.8
minimo: 17533.5
Edit: I have normalized the values timestamp, sensor_type, sensor_name and data_type and created a _view to facilitate consuming of data:
CREATE VIEW `_view` AS (
select (
select `vtmp`.`timestamp` from `timestamp` `vtmp` where (`vtmp`.`no` = `pm`.`timestamp`)) AS `timestamp`,(
select `vtmp`.`sensor_type` from `sensor_type` `vtmp` where (`vtmp`.`no` = `pm`.`sensor_type`)) AS `sensor_type`,(
select `vtmp`.`sensor_name` from `sensor_name` `vtmp` where (`vtmp`.`no` = `pm`.`sensor_name`)) AS `sensor_name`,(
select `vtmp`.`data_type` from `data_type` `vtmp` where (`vtmp`.`no` = `pm`.`data_type`)) AS `data_type`,
`pm`.`massimo` AS `massimo`,
`pm`.`minimo` AS `minimo`
from `datam1` `pm` order by `pm`.`timestamp` desc);
Is there a way to speed up with indexing, sharding and/or partitioning? Or is better to re-think the table separating the information in different tables? If so, could anyone propose his best practice in such a situation?
Upvotes: 3
Views: 1264
Reputation: 142298
This answer discusses how to build a Summary Table.
CREATE TABLE Summary (
-- The primary key:
hr DATETIME NOT NULL COMMENT "Start of hour",
sensor_type ...,
sensor_name ...,
-- The aggregates being collected:
num_readings SMALLINT UNSIGNED NOT NULL,
sum_reading FLOAT NOT NULL, -- (maybe)
min_reading FLOAT NOT NULL,
max_reading FLOAT NOT NULL,
PRIMARY KEY(hr, sensor_type, sensor_name),
INDEX(sensor_name, hour) -- Maybe you want to look up by sensor?
) ENGINE=InnoDB;
Every hour, populate it with something like
INSERT INTO Summary
(hr, sensor_type, sensor_name, num_readings,
sum_reading, min_reading, max_reading)
SELECT
FROM_UNIXTIME(3600 * (FLOOR(UNIX_TIMESTAMP() / 3600) - 1)), -- start of prev hour
sensor_type,
sensor_name,
COUNT(*), -- how many readings were taken in the hour.
SUM(??), -- maybe this is not practical, since you seem to have pairs of readings
MAX(massimo),
MIN(minimo)
FROM DataM1
WHERE `timestamp` >= FROM_UNIXTIME(3600 * (FLOOR(UNIX_TIMESTAMP() / 3600) - 1))
AND `timestamp` < FROM_UNIXTIME(3600 * (FLOOR(UNIX_TIMESTAMP() / 3600)));
This assumes you are taking readings every, say, minute. If you are only taking readings once an hour, it would make more sense to summarize to the hour.
More discussion: Summary Tables .
To be more robust, the summarization INSERT-SELECT may need to be more complex -- what if you miss an hour. (And other things that can go wrong.)
Caveat: This summary table will be a lot faster than reading from the "Fact" table, but it can only display ranges of time based on whole hours. If you need "the last 60 minutes", you will need to go the the Fact table.
Another note: You should normalize bulky, repititous, things like sensor_name
in the Fact, but you could (maybe should) denormalize when building the Summary table. (I left out those steps in this example.)
For fetching the data for yesterday:
SELECT sensor_type, sensor_name, data_type,
MAX(massimo) as massimo,
MIN(minimo) as minimo
FROM Summary
WHERE timestamp >= CURRENT_DATE() - INTERVAL 1 DAY
AND timestamp < CURRENT_DATE()
GROUP BY sensor_type, sensor_name, data_type;
For all of June:
WHERE timestamp >= '2016-06-01'
AND timestamp < '2016-06-01' + INTERVAL 1 MONTH
Note: The simple way to get an average is to average the averages. But the mathematically correct way is to sum the sums and divide by the sum of the counts. Hence my inclusion of sum_reading
and num_readings
. On the other hand, when averaging things like weather readings, it is common to get the average for each day, then average over the days. I'll leave it to you decide what is 'right'.
Upvotes: 1
Reputation: 142298
sensor_name(11)
; it rarely helps and sometimes hurts.TEXT
; instead VARCHAR(...)
with some realistic limit.ENUM
is a reasonable alternative.PRIMARY KEY
. If no column (or set of columns) is Unique, then use an AUTO_INCREMENT
.GROUP BY
with the exact timestamp. Maybe truncate to the hour? For example, CONCAT(LEFT(timestamp, 13), ':xx')
would yield something like 2016-10-16 20:xx
.LIMIT
, nor ORDER BY
. Will that continue to be the case?Those suggestions will help in various ways. Once you have fixed most of them, we can discuss how to use Summary Tables to get a 10x speedup.
Upvotes: 2
Reputation: 151
I think that is such use cases, when you have so much data, maybe the best solution would be to use a noSQL database, and perform some aggregation before storing the data. You could have a look at Google Big Query and Cloud Data Flow
However, to answer your question I would pre-calculate the data aggregation using the min granularity required for my system (you could calculate the aggregation every 10 min) and then you will be able to perform your query on smaller amount of data.
Upvotes: -1