Reputation: 505
I need to get users count, grouped by user type (A,B,C) and every month (that exist in db) in current year - only for users who don't have paid orders (with total > 0) in every month (every row returned by SQL), but have orders (with total > 0) in any previous months (in any year, not just current). In other words this is inactive users, who placed some paid order before, but don't placed any new orders in current SQL request row month returned.
What I expect to get in results (values are just examples):
label user_type data
Nov B 2
Nov A 1
Nov C 3
Dec C 1
.... other months
This means that in December there are 5 users with user type A and 3 users with user type B and 0 users with user type C, who DON'T placed orders in December 2021, but placed orders sometime before December in any year.
Sample DB (two tables - users and orders) with SQL that show number users, by every user type, in every month, who placed orders in this month. Instead of just this simple results, I need to get users counts that DON'T placed orders in this month, but placed paid orders somewhere before.
https://dbfiddle.uk/?rdbms=mysql_5.6&fiddle=4c4fadf67bcdc7cc3443f46c387173df
I need SQL that will work with MySQL 5.7
Upvotes: 3
Views: 615
Reputation: 12973
This uses a similar approach to VeteranSlayer but it starts with the cross join between months and users followed by the left join to orders. It also uses ranges for the date comparisons instead of the functions. It may perform really badly but it should give the correct result -
SELECT
months.month AS `label`,
u.user_type,
COUNT(u.userid) AS `data`
FROM (
SELECT 'Jan' `month`, '2021-01-01' month_start, '2021-01-31' month_end UNION ALL
SELECT 'Feb', '2021-02-01', '2021-02-28' UNION ALL
SELECT 'Mar', '2021-03-01', '2021-03-31' UNION ALL
SELECT 'Apr', '2021-04-01', '2021-04-30' UNION ALL
SELECT 'May', '2021-05-01', '2021-05-31' UNION ALL
SELECT 'Jun', '2021-06-01', '2021-06-30' UNION ALL
SELECT 'Jul', '2021-07-01', '2021-07-31' UNION ALL
SELECT 'Aug', '2021-08-01', '2021-08-31' UNION ALL
SELECT 'Sep', '2021-09-01', '2021-09-30' UNION ALL
SELECT 'Oct', '2021-10-01', '2021-10-31' UNION ALL
SELECT 'Nov', '2021-11-01', '2021-11-30' UNION ALL
SELECT 'Dec', '2021-12-01', '2021-12-31'
) months
INNER JOIN users u
LEFT JOIN orders o
ON o.date BETWEEN months.month_start AND months.month_end
AND o.user_id = u.userid
WHERE o.user_id IS NULL
AND EXISTS (
SELECT DISTINCT o1.user_id
FROM orders o1
WHERE o1.date < months.month_start
AND o1.user_id = u.userid
)
GROUP BY months.month, u.user_type
ORDER BY months.month_start ASC, u.user_type ASC;
EDIT
The performance of these queries varies dramatically based on the scale of the dataset, the distribution of data and the indices. I have done some tests with many different index variations and the following test datasets. Note the random data created in the two tables can lead to wildly different performance. The dummy
table referenced in the SELECTs of the INSERTs is just a random table with 1M rows.
CREATE TABLE `users` (
`id` int unsigned NOT NULL AUTO_INCREMENT PRIMARY KEY,
`user_type` char(1) NOT NULL,
KEY `IDX_user_type` (`user_type`)
);
INSERT INTO users (user_type)
SELECT
CASE (FLOOR(RAND() * 3) + 1) WHEN 1 THEN 'A' WHEN 2 THEN 'B' ELSE 'C' END AS `user_type`
FROM dummy
LIMIT 1000;
CREATE TABLE orders (
`id` int UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
`user_id` int,
`date` DATE,
`total` DECIMAL(6,2),
KEY `IDX_user_id_date` (`user_id`,`date`)
);
INSERT INTO orders (user_id, date, total)
SELECT
(FLOOR(RAND() * 1000) + 1) AS `user_id`,
('2020-01-01' + INTERVAL FLOOR(RAND() * 685) + 1 DAY) AS `date`,
( (FLOOR(RAND() * 10) + 1) * 5) AS `total`
FROM dummy
LIMIT 100000;
The most significant performance difference across the queries came from adding -
KEY `IDX_user_id_date` (`user_id`,`date`)
and adding the user_type index gave a small but consistent improvement -
KEY `IDX_user_type` (`user_type`)
ProGu's query executed consistently with an average time of 1.466 sec. And my query was similarly consistent at 0.922 sec. Your mileage will vary!
I haven't included time's for VeteranSlayer's query as it returned radically different results.
EDIT 2
Repopulated the two tables with 50k users and 1M orders
TRUNCATE TABLE orders;
TRUNCATE TABLE users;
INSERT INTO users (user_type)
SELECT
CASE (FLOOR(RAND() * 3) + 1) WHEN 1 THEN 'A' WHEN 2 THEN 'B' ELSE 'C' END AS `user_type`
FROM (SELECT 1 FROM dummy LIMIT 50000) t;
INSERT INTO orders (user_id, date, total)
SELECT
(FLOOR(RAND() * 50000) + 1),
TIMESTAMPADD(SECOND, FLOOR(RAND() * TIMESTAMPDIFF(SECOND, '2016-01-01', '2021-12-13')), '2016-01-01'),
((FLOOR(RAND() * 50) + 1) * 5)
FROM (SELECT 1 FROM dummy LIMIT 1000000) t
ORDER BY date;
The resulting distribution of orders, by time and user_id, is quite even which is unlikely to be realistic so this test dataset grossly exacerbates any performance issues, I think.
I was surprised that by using my months table, ProGu's query was significantly faster, dropping from 21.062sec to 9.703sec, and using one less temporary table (two instead of three).
SELECT
months.month as label,
users.user_type,
SUM(
EXISTS (
SELECT 1
FROM orders
WHERE orders.user_id = users.id
AND orders.`date` < months.month_start
) AND NOT EXISTS (
SELECT 1
FROM orders
WHERE orders.user_id = users.id
AND orders.`date` BETWEEN months.month_start AND months.month_end
)
) counts
FROM (
SELECT 'Jan' `month`, '2021-01-01' month_start, '2021-01-31' month_end UNION ALL
SELECT 'Feb', '2021-02-01', '2021-02-28' UNION ALL
SELECT 'Mar', '2021-03-01', '2021-03-31' UNION ALL
SELECT 'Apr', '2021-04-01', '2021-04-30' UNION ALL
SELECT 'May', '2021-05-01', '2021-05-31' UNION ALL
SELECT 'Jun', '2021-06-01', '2021-06-30' UNION ALL
SELECT 'Jul', '2021-07-01', '2021-07-31' UNION ALL
SELECT 'Aug', '2021-08-01', '2021-08-31' UNION ALL
SELECT 'Sep', '2021-09-01', '2021-09-30' UNION ALL
SELECT 'Oct', '2021-10-01', '2021-10-31' UNION ALL
SELECT 'Nov', '2021-11-01', '2021-11-30' UNION ALL
SELECT 'Dec', '2021-12-01', '2021-12-31'
) months
CROSS JOIN users
GROUP BY months.month, users.user_type
ORDER BY months.month_start ASC, users.user_type ASC
My query above can be significantly improved by pre grouping the orders data for the current year (your mileage will vary but worth considering) -
SELECT
months.month AS `label`,
u.user_type,
COUNT(u.id) AS `data`
FROM (
SELECT 'Jan' `month`, '2021-01-01' month_start, '2021-01-31' month_end UNION ALL
SELECT 'Feb', '2021-02-01', '2021-02-28' UNION ALL
SELECT 'Mar', '2021-03-01', '2021-03-31' UNION ALL
SELECT 'Apr', '2021-04-01', '2021-04-30' UNION ALL
SELECT 'May', '2021-05-01', '2021-05-31' UNION ALL
SELECT 'Jun', '2021-06-01', '2021-06-30' UNION ALL
SELECT 'Jul', '2021-07-01', '2021-07-31' UNION ALL
SELECT 'Aug', '2021-08-01', '2021-08-31' UNION ALL
SELECT 'Sep', '2021-09-01', '2021-09-30' UNION ALL
SELECT 'Oct', '2021-10-01', '2021-10-31' UNION ALL
SELECT 'Nov', '2021-11-01', '2021-11-30' UNION ALL
SELECT 'Dec', '2021-12-01', '2021-12-31'
) months
INNER JOIN users u
LEFT JOIN (
SELECT `user_id`, DATE_FORMAT(`date`, '%Y-%m-01') AS `m`
FROM `orders`
WHERE `date` >= '2021-01-01'
GROUP BY `user_id`, `m`
) o
ON o.m = months.month_start
AND o.user_id = u.id
WHERE o.user_id IS NULL
AND EXISTS (
SELECT 1
FROM orders o1
WHERE o1.date < months.month_start
AND o1.user_id = u.id
)
GROUP BY months.month, u.user_type
ORDER BY months.month_start ASC, u.user_type ASC
Execution time dropped from 12.422sec to 6.497sec
And the final test I tried was de-normalising by adding first_order_date to the users table -
ALTER TABLE `users` ADD COLUMN `first_order_date` DATE NULL AFTER `user_type`;
UPDATE users u
INNER JOIN (SELECT o.user_id, MIN(date) AS `first_o`, MAX(date) AS `last_o` FROM orders o GROUP BY o.user_id) t ON u.id = t.user_id
SET `u`.`first_order_date` = `t`.`first_o`, `u`.`last_order_date` = `t`.`last_o`;
I then modified my query to use this instead of the EXISTS sub-query -
SELECT
`months`.`month` AS `label`,
`u`.`user_type`,
COUNT(`u`.`id`) AS `data`
FROM (
SELECT 'Jan' `month`, '2021-01-01' month_start, '2021-01-31' month_end UNION ALL
SELECT 'Feb', '2021-02-01', '2021-02-28' UNION ALL
SELECT 'Mar', '2021-03-01', '2021-03-31' UNION ALL
SELECT 'Apr', '2021-04-01', '2021-04-30' UNION ALL
SELECT 'May', '2021-05-01', '2021-05-31' UNION ALL
SELECT 'Jun', '2021-06-01', '2021-06-30' UNION ALL
SELECT 'Jul', '2021-07-01', '2021-07-31' UNION ALL
SELECT 'Aug', '2021-08-01', '2021-08-31' UNION ALL
SELECT 'Sep', '2021-09-01', '2021-09-30' UNION ALL
SELECT 'Oct', '2021-10-01', '2021-10-31' UNION ALL
SELECT 'Nov', '2021-11-01', '2021-11-30' UNION ALL
SELECT 'Dec', '2021-12-01', '2021-12-31'
) `months`
INNER JOIN `users` `u`
LEFT JOIN (
SELECT `user_id`, DATE_FORMAT(`date`, '%Y-%m-01') AS `m`
FROM `orders`
WHERE `date` >= '2021-01-01'
GROUP BY `user_id`, `m`
) o
ON `o`.`m` = `months`.`month_start`
AND `o`.`user_id` = `u`.`id`
WHERE `o`.`user_id` IS NULL
AND `u`.`first_order_date` < `months`.`month_start`
GROUP BY `months`.`month`, `u`.`user_type`
ORDER BY `months`.`month_start` ASC, `u`.`user_type` ASC;
This returns the same result in 1.447sec. Obviously, de-normalising like this should be avoided but I included it here as it shows the performance benefit for this one scenario.
Upvotes: 0
Reputation: 5410
Try this query to generate counts for all months x user type
SELECT
DATE_FORMAT(DATE(CONCAT_WS('-', YEAR(CURDATE()), months.mm, '01')), "%b") as label,
users.user_type,
SUM(
EXISTS (
SELECT 1
FROM orders
WHERE orders.user_id = users.userid
AND orders.`date` < DATE(CONCAT_WS('-', YEAR(CURDATE()), months.mm, '01'))
) AND NOT EXISTS (
SELECT 1
FROM orders
WHERE orders.user_id = users.userid
AND orders.`date` BETWEEN DATE(CONCAT_WS('-', YEAR(CURDATE()), months.mm, '01')) AND LAST_DAY(DATE(CONCAT_WS('-', YEAR(CURDATE()), months.mm, '01')))
)
) counts
FROM (
SELECT '01' mm
UNION SELECT '02' UNION SELECT '03' UNION SELECT '04' UNION SELECT '05'
UNION SELECT '06' UNION SELECT '07' UNION SELECT '08' UNION SELECT '09'
UNION SELECT '10' UNION SELECT '11' UNION SELECT '12'
) months
CROSS JOIN users
GROUP BY months.mm, users.user_type
Upvotes: 1
Reputation: 404
Test this query if it fits your needs
SELECT DATE_FORMAT(o.date, "%b") as label,
UPPER(u.user_type) as user_type,
COUNT(distinct o.user_id) as data FROM orders o
JOIN users u ON o.user_id = u.userid
WHERE DATE_FORMAT(o.date, "%Y") = "2021"
AND o.user_id NOT IN
(SELECT DISTINCT o1.user_id FROM orders o1 WHERE DATE_FORMAT(o1.date, "%b") = DATE_FORMAT(now(), "%b") AND YEAR(o1.date) = YEAR(now()) )
AND o.user_id IN
(SELECT DISTINCT o1.user_id FROM orders o1 WHERE (DATE_FORMAT(o1.date, "%c") < DATE_FORMAT(now(), "%c") OR YEAR(o1.date) < YEAR(now())))
GROUP BY DATE_FORMAT(o.date, "%Y %b"),
u.user_type HAVING SUM(o.total) > 0 ORDER BY o.date ASC
EDIT The query below returns every month of the year
SELECT months.MONTH as label,
ifnull(UPPER(u.user_type), '-') as user_type,
COUNT(distinct o.user_id) as data
FROM (
SELECT 1 AS MONTH
UNION SELECT 2 AS MONTH
UNION SELECT 3 AS MONTH
UNION SELECT 4 AS MONTH
UNION SELECT 5 AS MONTH
UNION SELECT 6 AS MONTH
UNION SELECT 7 AS MONTH
UNION SELECT 8 AS MONTH
UNION SELECT 9 AS MONTH
UNION SELECT 10 AS MONTH
UNION SELECT 11 AS MONTH
UNION SELECT 12 AS MONTH
) as months
LEFT JOIN orders o
ON DATE_FORMAT(o.date, "%c") = months.MONTH
LEFT JOIN users u ON o.user_id = u.userid
WHERE (DATE_FORMAT(o.date, "%Y") = "2021" OR o.date IS NULL)
AND (
(
NOT EXISTS
(SELECT DISTINCT o1.user_id
FROM orders o1
WHERE
DATE_FORMAT(o1.date, "%b") = DATE_FORMAT(now(), "%b")
AND YEAR(o1.date) = YEAR(now())
AND o1.user_id = o.user_id
)
AND EXISTS
(SELECT DISTINCT o1.user_id
FROM orders o1
WHERE
(DATE_FORMAT(o1.date, "%c") < DATE_FORMAT(now(), "%c") OR YEAR(o1.date) < YEAR(now())) AND o1.user_id = o.user_id
)
)
OR o.user_id IS null OR u.userid IS NULL
)
GROUP BY months.MONTH, u.user_type ORDER BY months.MONTH ASC
Upvotes: 0