Reputation: 291

MySQL database performance select on ultra large table

I have a sales data table in which average 1,329,415 rows are inserted in daily. I have to generate report from the table daily in different formats. But the query from the table is too much slow. Here is my SHOW CREATE TABLE command output.

CREATE TABLE `query_manager_table` (
  `mtime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `region_id` int(2) NOT NULL,
  `rtslug` varchar(10) DEFAULT NULL,
  `dsid` int(3) NOT NULL,
  `dpid` int(3) NOT NULL,
  `route_number` int(4) NOT NULL,
  `route_id` int(11) NOT NULL,
  `rtlid` int(11) NOT NULL,
  `retailer_code` varchar(16) DEFAULT NULL,
  `platform_code` varchar(16) DEFAULT NULL,
  `prid` int(4) NOT NULL,
  `skid` int(4) NOT NULL,
  `group` int(4) NOT NULL,
  `family` int(4) NOT NULL,
  `volume` float DEFAULT NULL,
  `value` float(7,2) DEFAULT NULL,
  `date` date NOT NULL DEFAULT '0000-00-00',
  `outlets` int(4) NOT NULL,
  `visited` int(4) NOT NULL,
  `channel` int(3) DEFAULT NULL,
  `subchannel` int(3) DEFAULT NULL,
  `tpg` int(4) DEFAULT NULL,
  `ioq` int(10) DEFAULT NULL,
  `sales_time` int(11) DEFAULT NULL,
  PRIMARY KEY (`dpid`,`route_id`,`rtlid`,`prid`,`skid`,`date`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

/*!50100 PARTITION BY LIST (YEAR(date) * 100 + QUARTER(date))
(PARTITION y2017q1 VALUES IN (201701) ENGINE = InnoDB,
 PARTITION y2017q2 VALUES IN (201702) ENGINE = InnoDB,
 PARTITION y2017q3 VALUES IN (201703) ENGINE = InnoDB,
 PARTITION y2017q4 VALUES IN (201704) ENGINE = InnoDB,
 PARTITION y2018q1 VALUES IN (201801) ENGINE = InnoDB,
 PARTITION y2018q2 VALUES IN (201802) ENGINE = InnoDB,
 PARTITION y2018q3 VALUES IN (201803) ENGINE = InnoDB,
 PARTITION y2018q4 VALUES IN (201804) ENGINE = InnoDB,
 PARTITION y2019q1 VALUES IN (201901) ENGINE = InnoDB,
 PARTITION y2019q2 VALUES IN (201902) ENGINE = InnoDB,
 PARTITION y2019q3 VALUES IN (201903) ENGINE = InnoDB,
 PARTITION y2019q4 VALUES IN (201904) ENGINE = InnoDB) */

Now I just want to know the by retailer sales from 1st September to 9th September by following query -

SELECT
            query_manager_table.dpid,
            query_manager_table.route_id,
            query_manager_table.rtlid,
            query_manager_table.prid,
            SUM(query_manager_table.`volume`) AS sales,
            1 AS memos
        FROM
            query_manager_table
        WHERE
            query_manager_table.date BETWEEN '2018-09-01'
        AND '2018-09-08'    
GROUP BY
            query_manager_table.dpid,
            query_manager_table.rtlid,
            query_manager_table.date

But it takes about 500-700 sec . I have added dpid IN (1,2,.....) AND prid IN (1,2,....) as both fileds are added as primary key. Then output comes after 300sec. What I am doing wrong?

+----+-------------+---------------------+------+---------------+------+---------+------+-----------+----------------------------------------------+
| id | select_type | table               | type | possible_keys | key  | key_len | ref  | rows      | Extra                                        |
+----+-------------+---------------------+------+---------------+------+---------+------+-----------+----------------------------------------------+
|  1 | SIMPLE      | query_manager_table | ALL  | PRIMARY       | NULL | NULL    | NULL | 129065467 | Using where; Using temporary; Using filesort |
+----+-------------+---------------------+------+---------------+------+---------+------+-----------+----------------------------------------------+

When I add all dpid and prid in where condition then EXPAIN look like

+----+-------------+---------------------+-------+---------------+---------+---------+------+--------+----------------------------------------------+
    | id | select_type | table               | type  | possible_keys | key     | key_len | ref  | rows   | Extra                                        |
    +----+-------------+---------------------+-------+---------------+---------+---------+------+--------+----------------------------------------------+
    |  1 | SIMPLE      | query_manager_table | range | PRIMARY       | PRIMARY | 4       | NULL | 128002 | Using where; Using temporary; Using filesort |
    +----+-------------+---------------------+-------+---------------+---------+---------+------+--------+----------------------------------------------+

Is there any way to optimize table or query? If I run EXPLAIN PARTITIONS SELECT... for the first one then get -

+----+-------------+---------------------+-------------------------------------------------------------------------------------------------+------+---------------+------+---------+------+-----------+----------------------------------------------+
| id | select_type | table               | partitions                                                                                      | type | possible_keys | key  | key_len | ref  | rows      | Extra                                        |
+----+-------------+---------------------+-------------------------------------------------------------------------------------------------+------+---------------+------+---------+------+-----------+----------------------------------------------+
|  1 | SIMPLE      | query_manager_table | y2017q1,y2017q2,y2017q3,y2017q4,y2018q1,y2018q2,y2018q3,y2018q4,y2019q1,y2019q2,y2019q3,y2019q4 | ALL  | PRIMARY       | NULL | NULL    | NULL | 127129410 | Using where; Using temporary; Using filesort |
+----+-------------+---------------------+-------------------------------------------------------------------------------------------------+------+---------------+------+---------+------+-----------+----------------------------------------------+

For the 2nd one I get -

+----+-------------+---------------------+-------------------------------------------------------------------------------------------------+-------+---------------+---------+---------+------+--------+----------------------------------------------+
| id | select_type | table               | partitions                                                                                      | type  | possible_keys | key     | key_len | ref  | rows   | Extra                                        |
+----+-------------+---------------------+-------------------------------------------------------------------------------------------------+-------+---------------+---------+---------+------+--------+----------------------------------------------+
|  1 | SIMPLE      | query_manager_table | y2017q1,y2017q2,y2017q3,y2017q4,y2018q1,y2018q2,y2018q3,y2018q4,y2019q1,y2019q2,y2019q3,y2019q4 | range | PRIMARY       | PRIMARY | 4       | NULL | 153424 | Using where; Using temporary; Using filesort |
+----+-------------+---------------------+-------------------------------------------------------------------------------------------------+-------+---------------+---------+---------+------+--------+----------------------------------------------+

Upvotes: 1

Answers (2)

Rick James

Reputation: 142356

INDEXes are used for efficiency in SELECTs.

The one PRIMARY KEY (in MySQL) is, by definition a unique INDEX. It should have a minimal set of columns that uniquely identify a row.

Any unique index (including the PK) is also a "uniqueness constraint" -- this prevents inserting multiple rows with the same set if values.

Indexes are used "from the left". That is, with INDEX(a,b), if a is not useful, it won't get to the b.

PARTITION BY LIST is virtually useless. It rarely, if ever, improves performance. You have shown us a couple of queries; let's see more of the typical queries so we can help you with indexes and partitioning.

    WHERE
        query_manager_table.date BETWEEN '2018-09-01'
                                     AND '2018-09-08'

begs for INDEX(date). In a composite index, the columns after a 'range' won't be reached. That is, in INDEX(date, x, y), testing date for a range (such as the 8 days in the WHERE), won't let it make use of x or y. On the other hand, WHERE date = '2018-09-01' AND x=1 will make use of more of the index.

float(7,2) -- don't use the (m,n) option on FLOAT or DOUBLE. Instead, switch to DECIMAL.

INT is always 4 bytes. See TINYINT (1 byte), SMALLINT (2 bytes), etc. This, alone, may cut the table size in half.

To explain this:

PRIMARY KEY (`dpid`,`route_id`, ...
WHERE ... AND dpid IN (...) AND ...

manages to use the first (remember: 'leftmost') for the pseudo-range IN, but can't use anything else in the PK since route_id is next.

This explains why the second EXPLAIN has a smaller "Rows". Also, note the "4" in "key_len" -- that's the number of bytes in dpid.

After you have made some of those changes, come back so we can discuss using Summary Tables to speed things up. However, "modify" may lead to complexity in this optimization.

How much RAM do you have? What is the value of innodb_buffer_pool_size?

Don't use GUIDs unless you must; they slow actions on large tables down due to the randomness.

Upvotes: 1

BobRodes

Reputation: 6165

I would not combine actual data fields to make a primary key. I would have a single field, and use an auto-incrementing integer or perhaps a GUID for the value. Having to go through six fields to identify a unique record takes more time than going through one, and as you say you run the risk of duplicate fields if a user is entering key data.

If you have business reasons to make those six fields unique when taken together, you should also work up a routine to identify whether or not an inserted record duplicates an existing one with respect to these fields. If you are batch inserting, you'll want to do this after inserting the records rather than checking each one as you insert it. You'll also want to index these six fields, to speed up your query for duplicates.

As for your SELECT query, you'll probably want to index the fields in your WHERE clause. In any case, you'll want to read up on execution plans and experiment with different indexes and key structures (probably easier to do on a subset of your data). Google "mysql execution plan" for lots of information.

Upvotes: 0

MySQL database performance select on ultra large table

Answers (2)

Related Questions