Reputation: 997

How to solve performance GROUP BY issue in MySQL?

I'm building a IoT system for home appliance stuff.

My data table has been created as

mysql> SHOW CREATE TABLE DataM1\G
*************************** 1. row ***************************
   Table: DataM1
Create Table: CREATE TABLE `DataM1` (
  `sensor_type` text,
  `sensor_name` text,
  `timestamp` datetime DEFAULT NULL,
  `data_type` text,
  `massimo` float DEFAULT NULL,
  `minimo` float DEFAULT NULL,
  KEY `timestamp_id` (`timestamp`) USING BTREE,
  KEY `super_index_id` (`timestamp`,`sensor_name`(11),`data_type`(11)) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8

and the query is

SELECT 
  sensor_type, sensor_name, timestamp, data_type, 
  MAX(massimo) as massimo, MIN(minimo) as minimo 
FROM DataM1 
  WHERE timestamp >= NOW() - INTERVAL 1 HOUR 
  GROUP BY timestamp, sensor_type, sensor_name, data_type;

Now, the problem is that when the table reaches 4 million (few days) rows the query takes 50+ seconds.

Edit: EXPLAIN result is as following:

           id: 1
    select_type: SIMPLE
          table: DataM1
     partitions: p0,p1,p2,p3,p4,p5,p6
           type: range
  possible_keys: timestamp_id,super_index_id
            key: timestamp_id
        key_len: 6
            ref: NULL
           rows: 1
       filtered: 100.00
          Extra: Using index condition; Using temporary; Using filesort

Edit: a sample row of reply is:

*************************** 418037. row ***************************
sensor_type: SEN
sensor_name: SEN_N2
  timestamp: 2016-10-16 17:28:48
  data_type: flow_rate
    massimo: 17533.8
     minimo: 17533.5

Edit: I have normalized the values timestamp, sensor_type, sensor_name and data_type and created a _view to facilitate consuming of data:

CREATE VIEW `_view` AS (
  select (
    select `vtmp`.`timestamp` from `timestamp` `vtmp` where (`vtmp`.`no` = `pm`.`timestamp`)) AS `timestamp`,(
      select `vtmp`.`sensor_type` from `sensor_type` `vtmp` where (`vtmp`.`no` = `pm`.`sensor_type`)) AS `sensor_type`,(
        select `vtmp`.`sensor_name` from `sensor_name` `vtmp` where (`vtmp`.`no` = `pm`.`sensor_name`)) AS `sensor_name`,(
          select `vtmp`.`data_type` from `data_type` `vtmp` where (`vtmp`.`no` = `pm`.`data_type`)) AS `data_type`,
          `pm`.`massimo` AS `massimo`,
          `pm`.`minimo` AS `minimo` 
          from `datam1` `pm` order by `pm`.`timestamp` desc);

Is there a way to speed up with indexing, sharding and/or partitioning? Or is better to re-think the table separating the information in different tables? If so, could anyone propose his best practice in such a situation?

Upvotes: 3

Answers (3)

Rick James

Reputation: 142298

This answer discusses how to build a Summary Table.

CREATE TABLE Summary (
    -- The primary key:
    hr DATETIME  NOT NULL  COMMENT "Start of hour",
    sensor_type ...,
    sensor_name ...,
    -- The aggregates being collected:
    num_readings SMALLINT UNSIGNED NOT NULL,
    sum_reading FLOAT NOT NULL,  -- (maybe)
    min_reading FLOAT NOT NULL,
    max_reading FLOAT NOT NULL,
    PRIMARY KEY(hr, sensor_type, sensor_name),
    INDEX(sensor_name, hour)   -- Maybe you want to look up by sensor?
) ENGINE=InnoDB;

Every hour, populate it with something like

INSERT INTO Summary
    (hr, sensor_type, sensor_name, num_readings,
     sum_reading, min_reading, max_reading)
    SELECT
        FROM_UNIXTIME(3600 * (FLOOR(UNIX_TIMESTAMP() / 3600) - 1)),   -- start of prev hour
        sensor_type,
        sensor_name,
        COUNT(*),   -- how many readings were taken in the hour.
        SUM(??),  -- maybe this is not practical, since you seem to have pairs of readings
        MAX(massimo),
        MIN(minimo)
    FROM DataM1
    WHERE `timestamp` >= FROM_UNIXTIME(3600 * (FLOOR(UNIX_TIMESTAMP() / 3600) - 1))
      AND `timestamp`  < FROM_UNIXTIME(3600 * (FLOOR(UNIX_TIMESTAMP() / 3600)));

This assumes you are taking readings every, say, minute. If you are only taking readings once an hour, it would make more sense to summarize to the hour.

More discussion: Summary Tables .

To be more robust, the summarization INSERT-SELECT may need to be more complex -- what if you miss an hour. (And other things that can go wrong.)

Caveat: This summary table will be a lot faster than reading from the "Fact" table, but it can only display ranges of time based on whole hours. If you need "the last 60 minutes", you will need to go the the Fact table.

Another note: You should normalize bulky, repititous, things like sensor_name in the Fact, but you could (maybe should) denormalize when building the Summary table. (I left out those steps in this example.)

For fetching the data for yesterday:

SELECT  sensor_type, sensor_name, data_type,
        MAX(massimo) as massimo,
        MIN(minimo) as minimo 
    FROM Summary 
    WHERE timestamp >= CURRENT_DATE() - INTERVAL 1 DAY 
      AND timestamp  < CURRENT_DATE()
    GROUP BY sensor_type, sensor_name, data_type;

For all of June:

    WHERE timestamp >= '2016-06-01'
      AND timestamp  < '2016-06-01' + INTERVAL 1 MONTH

Note: The simple way to get an average is to average the averages. But the mathematically correct way is to sum the sums and divide by the sum of the counts. Hence my inclusion of sum_reading and num_readings. On the other hand, when averaging things like weather readings, it is common to get the average for each day, then average over the days. I'll leave it to you decide what is 'right'.

Upvotes: 1

Rick James

Reputation: 142298

Do not use "prefix" indexing such as sensor_name(11); it rarely helps and sometimes hurts.
If you sensor name and type, and data_type can't be more than 255 characters, don't use TEXT; instead VARCHAR(...) with some realistic limit.
Normalize sensor name and type, and data_type -- I assume they are repeated a lot. ENUM is a reasonable alternative.
KEY(timestamp) and KEY(timestamp, ...) are redundant; DROP the former.
Your table needs a PRIMARY KEY. If no column (or set of columns) is Unique, then use an AUTO_INCREMENT.
Perhaps you don't want to start the GROUP BY with the exact timestamp. Maybe truncate to the hour? For example, CONCAT(LEFT(timestamp, 13), ':xx') would yield something like 2016-10-16 20:xx.
The main reason the query is taking a long time is that it is outputing 418K rows. What will you do with that many rows? I see no LIMIT, nor ORDER BY. Will that continue to be the case?
Partitioning and sharding will not help the speed any.

Those suggestions will help in various ways. Once you have fixed most of them, we can discuss how to use Summary Tables to get a 10x speedup.

Upvotes: 2

Simone Tallevi-Diotallevi

Reputation: 151

I think that is such use cases, when you have so much data, maybe the best solution would be to use a noSQL database, and perform some aggregation before storing the data. You could have a look at Google Big Query and Cloud Data Flow

However, to answer your question I would pre-calculate the data aggregation using the min granularity required for my system (you could calculate the aggregation every 10 min) and then you will be able to perform your query on smaller amount of data.

Upvotes: -1

How to solve performance GROUP BY issue in MySQL?

Answers (3)

Related Questions