Reputation: 23
I need to count the number of products that existed in inventory by date. In the database however, a product is only recorded when it was viewed by a consumer.
For example consider this basic table structure:
date | productId | views
July 1 | A | 8
July 2 | A | 6
July 2 | B | 4
July 3 | A | 2
July 4 | A | 8
July 4 | B | 6
July 4 | C | 4
July 5 | C | 2
July 10 | A | 17
Using the following query, I attempt to determine the amount of products in inventory on a given date.
select date, count(distinct productId) as Inventory, sum(views) as views
from (
select date, productId, count(*) as views
from SomeTable
group by date, productID
order by date asc, productID asc
)
group by date
This is the output
date | Inventory | views
July 1 | 1 | 8
July 2 | 2 | 10
July 3 | 1 | 2
July 4 | 3 | 18
July 5 | 1 | 2
July 10 | 1 | 17
My output is not an accurate reflection of how many products were in inventory due to missing rows.
The correct understanding of inventory is as follows:
- Product A was present in inventory from July 1 - July 10.
- Product B was present in inventory from July 2 - July 4.
- Product C was in inventory from July 4 - July 5.
The correct SQL output should be:
date | Inventory | views
July 1 | 1 | 8
July 2 | 2 | 10
July 3 | 2 | 2
July 4 | 3 | 18
July 5 | 2 | 2
July 6 | 1 | 0
July 7 | 1 | 0
July 8 | 1 | 0
July 9 | 1 | 0
July 10 | 1 | 17
If you are following along, let me confirm that I am comfortable defining "in inventory" as the date difference between the first & last view.
I have followed the following faulty process:
First I created a table which was the cartesian product of every productID & every date. ''' with Dates as ( select date from SomeTable group by date ), Products as ( select productId from SomeTable group by productId ) select Dates.date, Products.productId from Dates cross join Products '''
Then I attempted do a right outer join to reduce this to just the missing records:
with Records as (
select date, productId, count(*) as views
from SomeTable
group by date, productId
),
Cartesian as (
{See query above}
)
Select Cartesian.date, Cartesian.productId, 0 as views #for upcoming union
from Cartesian right outer join Records
on Cartesian.date = Records.date
where Records.productId is null
Then with the missing rows in hand, union them back onto the Records. in doing so, I create a new problem: extra rows.
date | productId | views
July 1 | A | 8
July 1 | B | 0
July 1 | C | 0
July 2 | A | 6
July 2 | B | 4
July 2 | C | 0
July 3 | A | 2
July 3 | B | 0
July 3 | C | 0
July 4 | A | 8
July 4 | B | 6
July 4 | C | 4
July 5 | A | 2
July 5 | B | 0
July 5 | C | 0
July 6 | A | 0
July 6 | B | 0
July 6 | C | 0
July 7 | A | 0
July 7 | B | 0
July 7 | C | 0
July 8 | A | 0
July 8 | B | 0
July 8 | C | 0
July 9 | A | 0
July 9 | B | 0
July 9 | C | 0
July 10 | A | 17
July 10 | B | 0
July 10 | C | 0
And when I run my simple query
select date, count(distinct productId) as Inventory, sum(views) as views
on that table I get the wrong output again:
date | Inventory | views
July 1 | 3 | 8
July 2 | 3 | 10
July 3 | 3 | 2
July 4 | 3 | 18
July 5 | 3 | 2
July 6 | 3 | 0
July 7 | 3 | 0
July 8 | 3 | 0
July 9 | 3 | 0
July 10 | 3 | 17
My next thought would be to iterate through each productId, determine it's first & last date, then Union that with the Cartesian table with the condition that the Cartesian.date falls between the first & last date for each specific product.
There's got to be an easier way to do this. Thanks.
Upvotes: 2
Views: 109
Reputation: 173190
Below is for BigQuery Standard SQL
#standardSQL
WITH dates AS (
SELECT day FROM (
SELECT MIN(day) min_day, MAX(day) max_day
FROM `project.dataset.table`
), UNNEST(GENERATE_DATE_ARRAY(min_day, max_day, INTERVAL 1 DAY)) day
), ranges AS (
SELECT productId, MIN(day) min_day, MAX(day) max_day
FROM `project.dataset.table` t
GROUP BY productId
)
SELECT day, COUNT(DISTINCT productId) Inventory, SUM(IFNULL(views, 0)) views
FROM dates d, ranges r
LEFT JOIN `project.dataset.table` USING(day, productId)
WHERE day BETWEEN min_day AND max_day
GROUP BY day
If to apply to sample data from your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT DATE '2019-07-01' day, 'A' productId, 8 views UNION ALL
SELECT '2019-07-02', 'A', 6 UNION ALL
SELECT '2019-07-02', 'B', 4 UNION ALL
SELECT '2019-07-03', 'A', 2 UNION ALL
SELECT '2019-07-04', 'A', 8 UNION ALL
SELECT '2019-07-04', 'B', 6 UNION ALL
SELECT '2019-07-04', 'C', 4 UNION ALL
SELECT '2019-07-05', 'C', 2 UNION ALL
SELECT '2019-07-10', 'A', 17
), dates AS (
SELECT day FROM (
SELECT MIN(day) min_day, MAX(day) max_day
FROM `project.dataset.table`
), UNNEST(GENERATE_DATE_ARRAY(min_day, max_day, INTERVAL 1 DAY)) day
), ranges AS (
SELECT productId, MIN(day) min_day, MAX(day) max_day
FROM `project.dataset.table` t
GROUP BY productId
)
SELECT day, COUNT(DISTINCT productId) Inventory, SUM(IFNULL(views, 0)) views
FROM dates d, ranges r
LEFT JOIN `project.dataset.table` USING(day, productId)
WHERE day BETWEEN min_day AND max_day
GROUP BY day
-- ORDER BY day
result is
Row day Inventory views
1 2019-07-01 1 8
2 2019-07-02 2 10
3 2019-07-03 2 2
4 2019-07-04 3 18
5 2019-07-05 2 2
6 2019-07-06 1 0
7 2019-07-07 1 0
8 2019-07-08 1 0
9 2019-07-09 1 0
10 2019-07-10 1 17
Upvotes: 1