Reputation: 23
I have a table like below:
timestamp | customer_id | product_name
2022-04-01 23:49:07 UTC | a23 | dog_shampoo
2022-04-01 23:49:07 UTC | a33 | dog_shampoo
2022-04-01 23:49:07 UTC | a45 | dog_toy
2022-04-01 23:49:09 UTC | a67 | dog_shampoo
2022-04-01 23:49:09 UTC | a66 | dog_toy
2022-04-01 23:49:09 UTC | a63 | dog_toy
2022-04-01 23:50:10 UTC | a50 | dog_shampoo
2022-04-01 23:50:11 UTC | a51 | dog_shampoo
If I want to get number of concurrent purchases for each product, it can be simply done by a group by like below:
SELECT
product_name,
timestamp,
COUNT(DISTINCT user_id) AS concurrent_purchases
FROM table
GROUP BY 1,2
Getting this output:
result1:
timestamp | product_name | concurrent_purchase
2022-04-01 23:49:07 UTC | dog_shampoo | 2
2022-04-01 23:49:07 UTC | dog_toy | 1
2022-04-01 23:49:09 UTC | dog_toy | 2
2022-04-01 23:49:09 UTC | dog_shampoo | 1
2022-04-01 23:50:10 UTC | dog_shampoo | 1
2022-04-01 23:50:11 UTC | dog_shampoo | 1
And then the maximum will be:
SELECT
product_name,
MAX(concurrent_purchases) AS max_concurrent_purchases
FROM result1
GROUP BY 1
from which the final result will be
product_name | max_concurrent_purchases
dog_shampoo | 2
dog_toy | 2
But if I define concurrent purchases as within 2 seconds of each other to build an output like below:
timestamp | product_name | concurrent_purchase
2022-04-01 23:49:07 UTC - 2022-04-01 23:49:09 UTC | dog_shampoo | 3
2022-04-01 23:49:09 UTC - 2022-04-01 23:49:11 UTC | dog_shampoo | 0
2022-04-01 23:49:09 UTC - 2022-04-01 23:49:11 UTC | dog_shampoo | 0
. | dog_shampoo | 0
. | dog_shampoo | 0
. | dog_shampoo | 0
2022-04-01 23:50:10 UTC - 2022-04-01 23:50:12 UTC | dog_shampoo | 2
2022-04-01 23:49:07 UTC - 2022-04-01 23:49:09 UTC | dog_toy | 3
That'll give this result:
product_name | max_concurrent_purchases
dog_shampoo | 3
dog_toy | 3
And I am interested in making a flexible solution where it is easy to change the interval values (2 seconds - 10 seconds) and run for different values.
I figured I can get the minimum timestamp for each product name and start incrementing by maybe a control flow statement
but I am not very familiar with setting variables and looping in SQL and wasn’t sure how slow a solution like this would be.
How can I build a query that gives me the desired output like below with different values for the interval?
interval (seconds) | product_name | concurrent_purchase
0 | dog_shampoo | 2
0 | dog_toy | 2
2 | dog_shampoo | 3
2 | dog_toy | 3
Upvotes: 0
Views: 138
Reputation: 173028
Consider below approach
select product_name, max(concurrent_purchase) as max_concurrent_purchases
from (
select *, count(*) over win as concurrent_purchase
from your_table
window win as (
partition by product_name
order by unix_seconds(timestamp)
range between current row and 2 following
)
)
group by product_name
if applied to sample data in your question - output is
If to use range between current row and 1 following
- the output is
Upvotes: 1
Reputation: 12254
Consider below:
You can adjust an interval via interval_value
variable.
DECLARE interval_value INT64 DEFAULT 2;
CREATE TEMP TABLE sample (
timestamp TIMESTAMP,
customer_id STRING,
product_name STRING
);
INSERT INTO sample VALUES
('2022-04-01 23:49:07 UTC', 'a23', 'dog_shampoo'),
('2022-04-01 23:49:07 UTC', 'a33', 'dog_shampoo'),
('2022-04-01 23:49:07 UTC', 'a45', 'dog_toy'),
('2022-04-01 23:49:09 UTC', 'a67', 'dog_shampoo'),
('2022-04-01 23:49:09 UTC', 'a66', 'dog_toy'),
('2022-04-01 23:49:09 UTC', 'a63', 'dog_toy'),
('2022-04-01 23:50:10 UTC', 'a50', 'dog_shampoo'),
('2022-04-01 23:50:11 UTC', 'a51', 'dog_shampoo')
;
WITH intervals AS (
SELECT evt_grp, product_name, COUNT(1) AS purchases FROM (
SELECT *, DIV(TIMESTAMP_DIFF(timestamp, MIN(timestamp) OVER (), SECOND), interval_value + 1) evt_grp
FROM sample
)
GROUP BY 1, 2
)
SELECT product_name, MAX(purchases) AS max_concurrent_purchasesax
FROM intervals
GROUP BY 1
;
If you set interval_value
as 1, output will be:
If interval_value
is 2, output will be:
Upvotes: 0
Reputation: 26
I made a query here to solve this but not using loops. What I explored was the GENERATE_TIMESTAMP_ARRAY function (https://cloud.google.com/bigquery/docs/reference/standard-sql/array_functions#generate_timestamp_array) and the TIMESTAMP_ADD function (https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions#timestamp_add) making an array with the intervals and after exploding it joining with the original table.
Here is the code:
DECLARE window_size INT64;
SET window_size = 2;
WITH timestamp_array AS (
SELECT GENERATE_TIMESTAMP_ARRAY(min(timestamp), max(timestamp), INTERVAL window_size SECOND) as ts_array
FROM <your_table>
)
,products AS (
SELECT *
FROM UNNEST(['dog_shampoo', 'dog_toy']) AS product
)
,timestamp_ranges AS (
SELECT flattened_timestamp AS start_time, TIMESTAMP_ADD(flattened_timestamp, INTERVAL window_size SECOND) AS end_time, p.product
FROM timestamp_array
CROSS JOIN UNNEST(timestamp_array.ts_array) AS flattened_timestamp
CROSS JOIN products p
)
,products_sold AS (
SELECT
a.start_time
,a.end_time
,a.product
,b.timestamp
,b.customer_id
,IF (b.product_name IS NOT NULL, 1, 0) AS purchase_count
FROM timestamp_ranges a
LEFT JOIN <your_table> b
ON a.start_time <= b.timestamp
AND a.end_time > b.timestamp
AND a.product = b.product_name
)
SELECT
CONCAT(start_time, ' - ', end_time) AS time_range
,product
,sum(purchase_count) as concurrent_purchases
FROM products_sold
GROUP BY start_time, end_time, product
ORDER BY product, time_range
;
The only additional comment is that I considered a timestamp "in the range" when it's bigger or equal than the start time and less than the end time. With this rule, the results are not exactly the same as your example. But if you want to change this behaviour you just need to change the rule of the join here:
LEFT JOIN <your_table> b
ON a.start_time <= b.timestamp
AND a.end_time > b.timestamp
Upvotes: 0