Reputation: 158009
I've got a very simple query that have to group the results by the field from the joined table:
SELECT SQL_NO_CACHE p.name, COUNT(1) FROM ycs_sales s
INNER JOIN ycs_products p ON s.id = p.sales_id
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND '2018-02-22 23:59:59'
GROUP BY p.name
Table ycs_products is actually sales_products, lists products in each sale. I want to see the share of each product sold over a period of time.
The current query speed is 2 seconds which is too much for the user interaction. I need to make this query run fast. Is there a way to get rid of Using temporary
without denormalization?
The join order is critically important, there is a lot of data in both tables and limiting the number of records by date is unquestionable prerequisite.
here goes the Explain result
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: s
type: range
possible_keys: PRIMARY,dtm
key: dtm
key_len: 6
ref: NULL
rows: 1164728
Extra: Using where; Using index; Using temporary; Using filesort
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: p
type: ref
possible_keys: sales_id
key: sales_id
key_len: 5
ref: test.s.id
rows: 1
Extra:
2 rows in set (0.00 sec)
and the same in json
EXPLAIN: {
"query_block": {
"select_id": 1,
"filesort": {
"sort_key": "p.`name`",
"temporary_table": {
"table": {
"table_name": "s",
"access_type": "range",
"possible_keys": ["PRIMARY", "dtm"],
"key": "dtm",
"key_length": "6",
"used_key_parts": ["dtm"],
"rows": 1164728,
"filtered": 100,
"attached_condition": "s.dtm between '2018-02-16 00:00:00' and '2018-02-22 23:59:59'",
"using_index": true
},
"table": {
"table_name": "p",
"access_type": "ref",
"possible_keys": ["sales_id"],
"key": "sales_id",
"key_length": "5",
"used_key_parts": ["sales_id"],
"ref": ["test.s.id"],
"rows": 1,
"filtered": 100
}
}
}
}
}
as well as create tables though I find it unecessary
CREATE TABLE `ycs_sales` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`dtm` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `dtm` (`dtm`)
) ENGINE=InnoDB AUTO_INCREMENT=2332802 DEFAULT CHARSET=latin1
CREATE TABLE `ycs_products` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`sales_id` int(11) DEFAULT NULL,
`name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `sales_id` (`sales_id`)
) ENGINE=InnoDB AUTO_INCREMENT=2332802 DEFAULT CHARSET=latin1
And also a PHP code to replicate the test environment
#$pdo->query("set global innodb_flush_log_at_trx_commit = 2");
$pdo->query("create table ycs_sales (id int auto_increment primary key, dtm datetime)");
$stmt = $pdo->prepare("insert into ycs_sales values (null, ?)");
foreach (range(mktime(0,0,0,2,1,2018), mktime(0,0,0,2,28,2018)) as $stamp){
$stmt->execute([date("Y-m-d", $stamp)]);
}
$max_id = $pdo->lastInsertId();
$pdo->query("alter table ycs_sales add key(dtm)");
$pdo->query("create table ycs_products (id int auto_increment primary key, sales_id int, name varchar(255))");
$stmt = $pdo->prepare("insert into ycs_products values (null, ?, ?)");
$products = ['food', 'drink', 'vape'];
foreach (range(1, $max_id) as $id){
$stmt->execute([$id, $products[rand(0,2)]]);
}
$pdo->query("alter table ycs_products add key(sales_id)");
Upvotes: 1
Views: 656
Reputation: 57453
I have had similar problems several times. Usually, I'd expect the best results to be obtained with
CREATE INDEX s_date ON ycs_sales(dtm, id)
-- Add a covering index
CREATE INDEX p_name ON ycs_products(sales_id, name);
This ought to get rid of the "the tables are very large" problem, since all information required is now contained in the two indexes. Actually I seem to remember that the first index does not need id
if the latter is the primary key.
If this is still not enough, because the two tables are too large, then you have no choice - you must avoid the JOIN. It is already going as fast as it can and if that's not enough, then it has to go.
I believe you can do this with a couple of TRIGGER
s to maintain an ancillary daily sales report table (if you never have returned products, then just the one trigger on INSERT in sales will suffice) - try to go with just (product_id, sales_date, sales_count)
and JOIN that with the product table to get the name upon output; but, if that is not enough, then use (product_id, product_name, sales_date, sales_count)
and periodically update product_name
to keep names synced by reading them off the primary table. Since sales_date
is now unique and you run searches on that, you can declare sales_date
a primary key and partition the ancillary table based on the sales year.
(Once or twice, when partitioning was not possible but I was confident that I would only very rarely cross the "ideal" partition boundary, I partitioned manually - i.e. sales_2012, sales_2013, sales_2014 - and built programmatically a UNION of the two or three years involved, followed by a regroup, resort and secondary totalization stage. Crazy as a March hare, yes, but it worked).
Upvotes: 1
Reputation: 31832
I've run sum test queries on the same data set. And here are my results:
Your query executes in 1.4 seconds.
After adding the covering index on ycs_products(sales_id, name)
with
ALTER TABLE `ycs_products`
DROP INDEX `sales_id`,
ADD INDEX `sales_id_name` (`sales_id`, `name`)
the execution time drops to 1.0 second.
I still see "Using temporary; Using filesort" in the EXPLAIN result.
But now there is also "Using index" - Which means, no lookup to the clustered index is needed to get the values of the name
column.
Note: I dropped the old index, since it will be redundant for most queries.
But you might have some queries which need that index with id
(PK) comming right after sales_id
.
You explicitly asked, how to get rid of "Using temporary". But even if you find a way to force an execution plan, which will avoid the filesort, you wouldn't win much. Consider the follwing query:
SELECT SQL_NO_CACHE COUNT(1) FROM ycs_sales s
INNER JOIN ycs_products p ON s.id = p.sales_id
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND '2018-02-22 23:59:59'
This one needs 0.855 seconds. Since there is no GROUP BY clause, no filesort is performed. It doesn't return the result, that you want - But the poit is: This is the bottom limit of what you can get, without storing and maintaining redundant data.
If you want to know, where the most time is spent by the engine - Remove the JOIN:
SELECT SQL_NO_CACHE COUNT(1) FROM ycs_sales s
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND '2018-02-22 23:59:59'
It executes in 0.155 seconds. So we can conclude: The JOIN is the most expensive part of the query. And you cannot avoid it.
The full list of execution times:
So again: "Using temporary; Using filesort" looks bad in the EXPLAIN result - But it's not your biggest problem.
Windows 10 + MariaDB 10.3.13 with innodb_buffer_pool_size = 1G
Test data has been generated with the following script (needs like 1 to 2 min. on a HDD):
drop table if exists ids;
create table ids(id mediumint unsigned auto_increment primary key);
insert into ids(id)
select null as id
from information_schema.COLUMNS c1
, information_schema.COLUMNS c2
, information_schema.COLUMNS c3
limit 2332801 -- 60*60*24*27 + 1;
drop table if exists ycs_sales;
CREATE TABLE `ycs_sales` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`dtm` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `dtm` (`dtm`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
insert into ycs_sales(id, dtm) select id, date('2018-02-01' + interval (id-1) second) from ids;
drop table if exists ycs_products;
CREATE TABLE `ycs_products` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`sales_id` int(11) DEFAULT NULL,
`name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `sales_id` (`sales_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
insert into ycs_products(id, sales_id, name)
select id
, id as sales_id
, case floor(rand(1)*3)
when 0 then 'food'
when 1 then 'drink'
when 2 then 'vape'
end as name
from ids;
Upvotes: 1
Reputation: 2382
Referring to your below comment I assume filtering by column s.dtm
is inevitable.
The join order is critically important, there is a lot of data in both tables and limiting the number of records by date is unquestionable prerequisite.
Most crucial action you can take is to observe the frequent search patterns.
For example, if your search criteria for dtm is usually to retrieve whole days' data, i.e. a few days data (say less then 15) and between 00:00:00
and 23:59:59
for all those days, you can use this information to offload your overhead in search time to insert time.
A method to do so; you can add a new column in your table which holds the truncated day data and you can hash index that new column. (In Mysql there is no such concept as a functional index as it does in Oracle. That is why we need to add a new column to imitate that functionality). Something like:
alter table ycs_sales add dtm_truncated date;
delimiter //
create trigger dtm_truncater_insert
before insert on ycs_sales
for each row
set new.dtm_truncated = date(new.dtm);
//
delimiter //
create trigger dtm_truncater_update
before update on ycs_sales
for each row
set new.dtm_truncated = date(new.dtm);
//
create index index_ycs_sales_dtm_truncated on ycs_sales(dtm_truncated) using hash;
# execute the trigger for existing rows, bypass the safe update mode by id > -1
update ycs_sales set dtm = date(dtm) where id > -1;
Then you can query using the dtm_truncated
field with the IN
command. But of course this has its own tradeoffs, longer ranges will not work. But as I mentioned above in bold, what you can do is to use the new column as a function output that indexes possible searches in the insert / update time.
SELECT SQL_NO_CACHE p.name, COUNT(1) FROM ycs_sales s
INNER JOIN ycs_products p ON s.id = p.sales_id
WHERE s.dtm_truncated in ( '2018-02-16', '2018-02-17', '2018-02-18', '2018-02-19', '2018-02-20', '2018-02-21', '2018-02-22')
GROUP BY p.name
Additionally assure your key on dtm
is a BTREE key. (If it is a hash key, then InnoDB needs to go through all keys.) Generating a BTREE syntax is:
create index index_ycs_sales_dtm on ycs_sales(dtm) using btree;
One final note:
Actually "partitioning pruning" (ref: here) is a concept to partition your data at insert time. But in MySql, I don't know why, partitioning requires related column to be in the primary key. I believe you don't want to add dtm
column into the primary key. But if you can do so, then you can also partition your data and get rid of the date range check overhead at the select time.
Upvotes: 2
Reputation: 1410
Not really providing an answer here, but I believe the core of the issue here is nailing down where the real slowdown is happening. I'm not a MySQL expert, but I would try and run the following queries:
SELECT SQL_NO_CACHE name, count(*) FROM (
SELECT p.name FROM ycs_sales s INNER JOIN ycs_products p ON s.id = p.sales_id
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND '2018-02-22 23:59:59')
GROUP BY name
SELECT SQL_NO_CACHE COUNT(*) FROM (
SELECT SQL_NO_CACHE name, count(*) FROM (
SELECT SQL_NO_CACHE p.name FROM ycs_sales s INNER JOIN ycs_products p ON s.id = p.sales_id
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND '2018-02-22 23:59:59')
GROUP BY name
)
SELECT SQL_NO_CACHE s.* FROM ycs_sales s
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND '2018-02-22 23:59:59'
SELECT SQL_NO_CACHE COUNT(*) FROM ycs_sales s
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND '2018-02-22 23:59:59'
When you do, can you tell us how long each one took?
Upvotes: 1
Reputation: 142528
Summary table.
Build and maintain a table that summarizes all sales on a daily basis. It would have the name
(denormalized) and date
. Hence the table should be smaller than the original data.
The summary table would be something like
CREATE TABLE sales_summary (
dy DATE NOT NULL,
name varchar(255) NOT NULL,
daily_count SMALLINT UNSIGNED NOT NULL,
PRIMARY KEY(dy, name),
INDEX(name, dy) -- (You might need this for other queries)
) ENGINE=InnoDB;
The nightly (after midnight) update would be a single query something like the following. It may well take more than 2 seconds, but no user is waiting for it.
INSERT INTO sales_summary (dy, name, one_day_count)
ON DUPLICATE KEY UPDATE
daily_count = daily_count + VALUES(one_day_count)
SELECT DATE(s.dtm) AS dy,
p.name,
COUNT(*) AS one_day_count
FROM ycs_sales s
JOIN ycs_products p ON s.id = p.sales_id
WHERE s.dtm >= CURDATE() - INTERVAL 1 DAY
AND s.dtm < CURDATE()
GROUP BY 1, 2;
And the user's query will be something like:
SELECT SQL_NO_CACHE
name,
SUM(one_day_count)
FROM sales_summary
WHERE dy >= '2018-02-16'
AND dy < '2018-02-16' + INTERVAL 7 DAY
GROUP BY name;
More discussion of Summary Tables: http://mysql.rjweb.org/doc.php/summarytables
Upvotes: 2
Reputation: 142528
Why have an id
for ycs_products
? It seems like the sales_id
should be the PRIMARY KEY
of that table?
If that is possible, it eliminates the performance problem by getting rid of the issues brought up by senape.
If, instead, there are multiple rows for each sales_id
, then changing the secondary index to this would help:
INDEX(sales_id, name)
Another thing to check on is innodb_buffer_pool_size
. It should be about 70% of available RAM. This would improve the cacheablity of data and indexes.
Are there really 1.1 million rows in that one week?
Upvotes: 2
Reputation: 342
The problem is that grouping by name
makes you lose the sales_id
information, therefore MySQL is forced to use a temporary table.
Although it's not the cleanest of the solutions, and one of my less favorites approach, you could add a new index, on both the name
and the sales_id
columns, like:
ALTER TABLE `yourdb`.`ycs_products`
ADD INDEX `name_sales_id_idx` (`name` ASC, `sales_id` ASC);
and force the query to use this index, with either force index
or use index
:
SELECT SQL_NO_CACHE p.name, COUNT(1) FROM ycs_sales s
INNER JOIN ycs_products p use index(name_sales_id_idx) ON s.id = p.sales_id
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND '2018-02-22 23:59:59'
GROUP BY p.name;
My execution reported only "using where; using index" on the table p and "using where" on the table s.
Anyway, I strongly suggest you to re-think about your schema, because probably you might find some better design for this two tables. On the other hand, if this is not a critical part of your application, you can deal with the "forced" index.
Since it's quite clear that the problem is in the design, I suggest drawing the relationships as a many-to-many. If you have chance to verify it into your testing environment, here's what I would do:
1) Create a temporary table just to store name and id of the product:
create temporary table tmp_prods
select min(id) id, name
from ycs_products
group by name;
2) Starting from the temporary table, join the sales table to create a replacement for the ycs_product
:
create table ycs_products_new
select * from tmp_prods;
ALTER TABLE `poc`.`ycs_products_new`
CHANGE COLUMN `id` `id` INT(11) NOT NULL ,
ADD PRIMARY KEY (`id`);
3) Create the join table:
CREATE TABLE `prod_sale` (
`prod_id` INT(11) NOT NULL,
`sale_id` INT(11) NOT NULL,
PRIMARY KEY (`prod_id`, `sale_id`),
INDEX `sale_fk_idx` (`sale_id` ASC),
CONSTRAINT `prod_fk`
FOREIGN KEY (`prod_id`)
REFERENCES ycs_products_new (`id`)
ON DELETE NO ACTION
ON UPDATE NO ACTION,
CONSTRAINT `sale_fk`
FOREIGN KEY (`sale_id`)
REFERENCES ycs_sales (`id`)
ON DELETE NO ACTION
ON UPDATE NO ACTION);
and fill it with the existing values:
insert into prod_sale (prod_id, sale_id)
select tmp_prods.id, sales_id from ycs_sales s
inner join ycs_products p
on p.sales_id=s.id
inner join tmp_prods on tmp_prods.name=p.name;
Finally, the join query:
select name, count(name) from ycs_products_new p
inner join prod_sale ps on ps.prod_id=p.id
inner join ycs_sales s on s.id=ps.sale_id
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND '2018-02-22 23:59:59'
group by p.id;
Please, note that the group by is on the primary key, not the name.
Explain output:
explain select name, count(name) from ycs_products_new p inner join prod_sale ps on ps.prod_id=p.id inner join ycs_sales s on s.id=ps.sale_id WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND '2018-02-22 23:59:59' group by p.id;
+------+-------------+-------+--------+---------------------+---------+---------+-----------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+--------+---------------------+---------+---------+-----------------+------+-------------+
| 1 | SIMPLE | p | index | PRIMARY | PRIMARY | 4 | NULL | 3 | |
| 1 | SIMPLE | ps | ref | PRIMARY,sale_fk_idx | PRIMARY | 4 | test.p.id | 1 | Using index |
| 1 | SIMPLE | s | eq_ref | PRIMARY,dtm | PRIMARY | 4 | test.ps.sale_id | 1 | Using where |
+------+-------------+-------+--------+---------------------+---------+---------+-----------------+------+-------------+
Upvotes: 5