Reputation: 158009

Optimize a query that group results by a field from the joined table

I've got a very simple query that have to group the results by the field from the joined table:

SELECT SQL_NO_CACHE p.name, COUNT(1) FROM ycs_sales s
INNER JOIN ycs_products p ON s.id = p.sales_id 
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND  '2018-02-22 23:59:59'
GROUP BY p.name

Table ycs_products is actually sales_products, lists products in each sale. I want to see the share of each product sold over a period of time.

The current query speed is 2 seconds which is too much for the user interaction. I need to make this query run fast. Is there a way to get rid of Using temporary without denormalization?

The join order is critically important, there is a lot of data in both tables and limiting the number of records by date is unquestionable prerequisite.

here goes the Explain result

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: s
         type: range
possible_keys: PRIMARY,dtm
          key: dtm
      key_len: 6
          ref: NULL
         rows: 1164728
        Extra: Using where; Using index; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: p
         type: ref
possible_keys: sales_id
          key: sales_id
      key_len: 5
          ref: test.s.id
         rows: 1
        Extra: 
2 rows in set (0.00 sec)

and the same in json

EXPLAIN: {
  "query_block": {
    "select_id": 1,
    "filesort": {
      "sort_key": "p.`name`",
      "temporary_table": {
        "table": {
          "table_name": "s",
          "access_type": "range",
          "possible_keys": ["PRIMARY", "dtm"],
          "key": "dtm",
          "key_length": "6",
          "used_key_parts": ["dtm"],
          "rows": 1164728,
          "filtered": 100,
          "attached_condition": "s.dtm between '2018-02-16 00:00:00' and '2018-02-22 23:59:59'",
          "using_index": true
        },
        "table": {
          "table_name": "p",
          "access_type": "ref",
          "possible_keys": ["sales_id"],
          "key": "sales_id",
          "key_length": "5",
          "used_key_parts": ["sales_id"],
          "ref": ["test.s.id"],
          "rows": 1,
          "filtered": 100
        }
      }
    }
  }
}

as well as create tables though I find it unecessary

    CREATE TABLE `ycs_sales` (
      `id` int(11) NOT NULL AUTO_INCREMENT,
      `dtm` datetime DEFAULT NULL,
      PRIMARY KEY (`id`),
      KEY `dtm` (`dtm`)
    ) ENGINE=InnoDB AUTO_INCREMENT=2332802 DEFAULT CHARSET=latin1
    CREATE TABLE `ycs_products` (
      `id` int(11) NOT NULL AUTO_INCREMENT,
      `sales_id` int(11) DEFAULT NULL,
      `name` varchar(255) DEFAULT NULL,
      PRIMARY KEY (`id`),
      KEY `sales_id` (`sales_id`)
    ) ENGINE=InnoDB AUTO_INCREMENT=2332802 DEFAULT CHARSET=latin1

And also a PHP code to replicate the test environment

#$pdo->query("set global innodb_flush_log_at_trx_commit = 2");
$pdo->query("create table ycs_sales (id int auto_increment primary key, dtm datetime)");
$stmt = $pdo->prepare("insert into ycs_sales values (null, ?)");
foreach (range(mktime(0,0,0,2,1,2018), mktime(0,0,0,2,28,2018)) as $stamp){
    $stmt->execute([date("Y-m-d", $stamp)]);
}
$max_id = $pdo->lastInsertId();
$pdo->query("alter table ycs_sales add key(dtm)");

$pdo->query("create table ycs_products (id int auto_increment primary key, sales_id int, name varchar(255))");
$stmt = $pdo->prepare("insert into ycs_products values (null, ?, ?)");
$products = ['food', 'drink', 'vape'];
foreach (range(1, $max_id) as $id){
    $stmt->execute([$id, $products[rand(0,2)]]);
}
$pdo->query("alter table ycs_products add key(sales_id)");

Upvotes: 1

Answers (7)

LSerni

Reputation: 57453

I have had similar problems several times. Usually, I'd expect the best results to be obtained with

CREATE INDEX s_date ON ycs_sales(dtm, id)
-- Add a covering index
CREATE INDEX p_name ON ycs_products(sales_id, name);

This ought to get rid of the "the tables are very large" problem, since all information required is now contained in the two indexes. Actually I seem to remember that the first index does not need id if the latter is the primary key.

If this is still not enough, because the two tables are too large, then you have no choice - you must avoid the JOIN. It is already going as fast as it can and if that's not enough, then it has to go.

I believe you can do this with a couple of TRIGGERs to maintain an ancillary daily sales report table (if you never have returned products, then just the one trigger on INSERT in sales will suffice) - try to go with just (product_id, sales_date, sales_count) and JOIN that with the product table to get the name upon output; but, if that is not enough, then use (product_id, product_name, sales_date, sales_count) and periodically update product_name to keep names synced by reading them off the primary table. Since sales_date is now unique and you run searches on that, you can declare sales_date a primary key and partition the ancillary table based on the sales year.

(Once or twice, when partitioning was not possible but I was confident that I would only very rarely cross the "ideal" partition boundary, I partitioned manually - i.e. sales_2012, sales_2013, sales_2014 - and built programmatically a UNION of the two or three years involved, followed by a regroup, resort and secondary totalization stage. Crazy as a March hare, yes, but it worked).

Upvotes: 1

Paul Spiegel

Reputation: 31832

I've run sum test queries on the same data set. And here are my results:

Your query executes in 1.4 seconds. After adding the covering index on ycs_products(sales_id, name) with

ALTER TABLE `ycs_products`
  DROP INDEX `sales_id`,
  ADD INDEX `sales_id_name` (`sales_id`, `name`)

the execution time drops to 1.0 second. I still see "Using temporary; Using filesort" in the EXPLAIN result. But now there is also "Using index" - Which means, no lookup to the clustered index is needed to get the values of the name column.

Note: I dropped the old index, since it will be redundant for most queries. But you might have some queries which need that index with id (PK) comming right after sales_id.

You explicitly asked, how to get rid of "Using temporary". But even if you find a way to force an execution plan, which will avoid the filesort, you wouldn't win much. Consider the follwing query:

SELECT SQL_NO_CACHE COUNT(1) FROM ycs_sales s
INNER JOIN ycs_products p ON s.id = p.sales_id 
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND  '2018-02-22 23:59:59'

This one needs 0.855 seconds. Since there is no GROUP BY clause, no filesort is performed. It doesn't return the result, that you want - But the poit is: This is the bottom limit of what you can get, without storing and maintaining redundant data.

If you want to know, where the most time is spent by the engine - Remove the JOIN:

SELECT SQL_NO_CACHE COUNT(1) FROM ycs_sales s
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND  '2018-02-22 23:59:59'

It executes in 0.155 seconds. So we can conclude: The JOIN is the most expensive part of the query. And you cannot avoid it.

The full list of execution times:

0.155 sec (11%) to read and count 604K rows
0.690 sec (49%) for the JOIN (which you cannot avoid)
0.385 sec (28%) for second lookup (which can be removed with an index)
0.170 sec (12%) for GROUP BY with filesort (which you try to avoid)

So again: "Using temporary; Using filesort" looks bad in the EXPLAIN result - But it's not your biggest problem.

Test environment:

Windows 10 + MariaDB 10.3.13 with innodb_buffer_pool_size = 1G

Test data has been generated with the following script (needs like 1 to 2 min. on a HDD):

drop table if exists ids;
create table ids(id mediumint unsigned auto_increment primary key);
insert into ids(id)
  select null as id
  from information_schema.COLUMNS c1
     , information_schema.COLUMNS c2
     , information_schema.COLUMNS c3
  limit 2332801 -- 60*60*24*27 + 1;
drop table if exists ycs_sales;
CREATE TABLE `ycs_sales` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `dtm` datetime DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `dtm` (`dtm`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
insert into ycs_sales(id, dtm) select id, date('2018-02-01' + interval (id-1) second) from ids;
drop table if exists ycs_products;
CREATE TABLE `ycs_products` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `sales_id` int(11) DEFAULT NULL,
  `name` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `sales_id` (`sales_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
insert into ycs_products(id, sales_id, name)
    select id
    , id as sales_id
    , case floor(rand(1)*3)
      when 0 then 'food'
      when 1 then 'drink'
      when 2 then 'vape'
    end as name
    from ids;

Upvotes: 1

Mehmet Kaplan

Reputation: 2382

Referring to your below comment I assume filtering by column s.dtm is inevitable.

The join order is critically important, there is a lot of data in both tables and limiting the number of records by date is unquestionable prerequisite.

Most crucial action you can take is to observe the frequent search patterns.

For example, if your search criteria for dtm is usually to retrieve whole days' data, i.e. a few days data (say less then 15) and between 00:00:00 and 23:59:59 for all those days, you can use this information to offload your overhead in search time to insert time.

A method to do so; you can add a new column in your table which holds the truncated day data and you can hash index that new column. (In Mysql there is no such concept as a functional index as it does in Oracle. That is why we need to add a new column to imitate that functionality). Something like:

alter table ycs_sales add dtm_truncated date;

delimiter //
create trigger dtm_truncater_insert
    before insert on ycs_sales 
    for each row 
        set new.dtm_truncated = date(new.dtm);
//
delimiter //
create trigger dtm_truncater_update
    before update on ycs_sales 
    for each row 
        set new.dtm_truncated = date(new.dtm);
//

create index index_ycs_sales_dtm_truncated on ycs_sales(dtm_truncated) using hash;

# execute the trigger for existing rows, bypass the safe update mode by id > -1
update ycs_sales set dtm = date(dtm) where id > -1;

Then you can query using the dtm_truncatedfield with the IN command. But of course this has its own tradeoffs, longer ranges will not work. But as I mentioned above in bold, what you can do is to use the new column as a function output that indexes possible searches in the insert / update time.

SELECT SQL_NO_CACHE p.name, COUNT(1) FROM ycs_sales s
INNER JOIN ycs_products p ON s.id = p.sales_id 
WHERE s.dtm_truncated in ( '2018-02-16',  '2018-02-17',  '2018-02-18',  '2018-02-19',  '2018-02-20',  '2018-02-21',  '2018-02-22')
GROUP BY p.name

Additionally assure your key on dtm is a BTREE key. (If it is a hash key, then InnoDB needs to go through all keys.) Generating a BTREE syntax is:

create index index_ycs_sales_dtm on ycs_sales(dtm) using btree;

One final note:

Actually "partitioning pruning" (ref: here) is a concept to partition your data at insert time. But in MySql, I don't know why, partitioning requires related column to be in the primary key. I believe you don't want to add dtmcolumn into the primary key. But if you can do so, then you can also partition your data and get rid of the date range check overhead at the select time.

Upvotes: 2

Filippo Possenti

Reputation: 1410

Not really providing an answer here, but I believe the core of the issue here is nailing down where the real slowdown is happening. I'm not a MySQL expert, but I would try and run the following queries:

SELECT SQL_NO_CACHE name, count(*) FROM (
    SELECT p.name FROM ycs_sales s INNER JOIN ycs_products p ON s.id = p.sales_id
    WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND '2018-02-22 23:59:59')
GROUP BY name

SELECT SQL_NO_CACHE COUNT(*) FROM (
    SELECT SQL_NO_CACHE name, count(*) FROM (
        SELECT SQL_NO_CACHE p.name FROM ycs_sales s INNER JOIN ycs_products p ON s.id = p.sales_id
        WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND '2018-02-22 23:59:59')
    GROUP BY name
)

    SELECT SQL_NO_CACHE s.* FROM ycs_sales s
    WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND '2018-02-22 23:59:59'

    SELECT SQL_NO_CACHE COUNT(*) FROM ycs_sales s
    WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND '2018-02-22 23:59:59'

When you do, can you tell us how long each one took?

Upvotes: 1

Rick James

Reputation: 142528

Summary table.

Build and maintain a table that summarizes all sales on a daily basis. It would have the name (denormalized) and date. Hence the table should be smaller than the original data.

The summary table would be something like

CREATE TABLE sales_summary (
    dy DATE NOT NULL,
    name varchar(255) NOT NULL,
    daily_count SMALLINT UNSIGNED NOT NULL,
    PRIMARY KEY(dy, name),
    INDEX(name, dy)   -- (You might need this for other queries)
) ENGINE=InnoDB;

The nightly (after midnight) update would be a single query something like the following. It may well take more than 2 seconds, but no user is waiting for it.

INSERT INTO sales_summary (dy, name, one_day_count)
    ON DUPLICATE KEY UPDATE
        daily_count = daily_count + VALUES(one_day_count)
    SELECT DATE(s.dtm) AS dy,
           p.name,
           COUNT(*) AS one_day_count
        FROM ycs_sales s
        JOIN ycs_products p ON s.id = p.sales_id
        WHERE s.dtm >= CURDATE() - INTERVAL 1 DAY
          AND s.dtm  < CURDATE()
        GROUP BY 1, 2;

And the user's query will be something like:

SELECT SQL_NO_CACHE 
        name,
        SUM(one_day_count)
    FROM sales_summary
    WHERE dy >= '2018-02-16'
      AND dy  < '2018-02-16' + INTERVAL 7 DAY
    GROUP BY name;

More discussion of Summary Tables: http://mysql.rjweb.org/doc.php/summarytables

Upvotes: 2

Rick James

Reputation: 142528

Why have an id for ycs_products? It seems like the sales_id should be the PRIMARY KEY of that table?

If that is possible, it eliminates the performance problem by getting rid of the issues brought up by senape.

If, instead, there are multiple rows for each sales_id, then changing the secondary index to this would help:

INDEX(sales_id, name)

Another thing to check on is innodb_buffer_pool_size. It should be about 70% of available RAM. This would improve the cacheablity of data and indexes.

Are there really 1.1 million rows in that one week?

Upvotes: 2

senape

Reputation: 342

The problem is that grouping by name makes you lose the sales_id information, therefore MySQL is forced to use a temporary table.

Although it's not the cleanest of the solutions, and one of my less favorites approach, you could add a new index, on both the name and the sales_id columns, like:

ALTER TABLE `yourdb`.`ycs_products` 
ADD INDEX `name_sales_id_idx` (`name` ASC, `sales_id` ASC);

and force the query to use this index, with either force index or use index:

SELECT SQL_NO_CACHE p.name, COUNT(1) FROM ycs_sales s
INNER JOIN ycs_products p use index(name_sales_id_idx) ON s.id = p.sales_id 
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND  '2018-02-22 23:59:59'
GROUP BY p.name;

My execution reported only "using where; using index" on the table p and "using where" on the table s.

Anyway, I strongly suggest you to re-think about your schema, because probably you might find some better design for this two tables. On the other hand, if this is not a critical part of your application, you can deal with the "forced" index.

EDIT

Since it's quite clear that the problem is in the design, I suggest drawing the relationships as a many-to-many. If you have chance to verify it into your testing environment, here's what I would do:

1) Create a temporary table just to store name and id of the product:

create temporary table tmp_prods
select min(id) id, name
from ycs_products
group by name;

2) Starting from the temporary table, join the sales table to create a replacement for the ycs_product:

create table ycs_products_new
select * from tmp_prods;

ALTER TABLE `poc`.`ycs_products_new` 
CHANGE COLUMN `id` `id` INT(11) NOT NULL ,
ADD PRIMARY KEY (`id`);

3) Create the join table:

CREATE TABLE `prod_sale` (
`prod_id` INT(11) NOT NULL,
`sale_id` INT(11) NOT NULL,
PRIMARY KEY (`prod_id`, `sale_id`),
INDEX `sale_fk_idx` (`sale_id` ASC),
CONSTRAINT `prod_fk`
  FOREIGN KEY (`prod_id`)
  REFERENCES ycs_products_new (`id`)
  ON DELETE NO ACTION
  ON UPDATE NO ACTION,
CONSTRAINT `sale_fk`
  FOREIGN KEY (`sale_id`)
  REFERENCES ycs_sales (`id`)
  ON DELETE NO ACTION
  ON UPDATE NO ACTION);

and fill it with the existing values:

insert into prod_sale (prod_id, sale_id)
select tmp_prods.id, sales_id from ycs_sales s
inner join ycs_products p
on p.sales_id=s.id
inner join tmp_prods on tmp_prods.name=p.name;

Finally, the join query:

select name, count(name) from ycs_products_new p
inner join prod_sale ps on ps.prod_id=p.id
inner join ycs_sales s on s.id=ps.sale_id 
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND  '2018-02-22 23:59:59'
group by p.id;

Please, note that the group by is on the primary key, not the name.

Explain output:

explain select name, count(name) from ycs_products_new p inner join prod_sale ps on ps.prod_id=p.id inner join ycs_sales s on s.id=ps.sale_id  WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND  '2018-02-22 23:59:59' group by p.id;
+------+-------------+-------+--------+---------------------+---------+---------+-----------------+------+-------------+
| id   | select_type | table | type   | possible_keys       | key     | key_len | ref             | rows | Extra       |
+------+-------------+-------+--------+---------------------+---------+---------+-----------------+------+-------------+
|    1 | SIMPLE      | p     | index  | PRIMARY             | PRIMARY | 4       | NULL            |    3 |             |
|    1 | SIMPLE      | ps    | ref    | PRIMARY,sale_fk_idx | PRIMARY | 4       | test.p.id       |    1 | Using index |
|    1 | SIMPLE      | s     | eq_ref | PRIMARY,dtm         | PRIMARY | 4       | test.ps.sale_id |    1 | Using where |
+------+-------------+-------+--------+---------------------+---------+---------+-----------------+------+-------------+

Upvotes: 5

Optimize a query that group results by a field from the joined table

Answers (7)

Test environment:

EDIT

Related Questions