loopswhile-loopgoogle-bigquerycontrol-flow

Reputation: 23

Ho do I a SQL query in BigQuery to calculate maximum events within a certain time interval?

I have a table like below:

timestamp               | customer_id | product_name
2022-04-01 23:49:07 UTC |   a23       |  dog_shampoo
2022-04-01 23:49:07 UTC |   a33       |  dog_shampoo
2022-04-01 23:49:07 UTC |   a45       |  dog_toy
2022-04-01 23:49:09 UTC |   a67       |  dog_shampoo
2022-04-01 23:49:09 UTC |   a66       |  dog_toy
2022-04-01 23:49:09 UTC |   a63       |  dog_toy
2022-04-01 23:50:10 UTC |   a50       |  dog_shampoo
2022-04-01 23:50:11 UTC |   a51       |  dog_shampoo

If I want to get number of concurrent purchases for each product, it can be simply done by a group by like below:

SELECT
   product_name,
   timestamp,
   COUNT(DISTINCT user_id) AS concurrent_purchases
FROM table
GROUP BY 1,2

Getting this output:

result1:
timestamp               | product_name  | concurrent_purchase
2022-04-01 23:49:07 UTC | dog_shampoo   |     2
2022-04-01 23:49:07 UTC | dog_toy       |     1
2022-04-01 23:49:09 UTC | dog_toy       |     2
2022-04-01 23:49:09 UTC | dog_shampoo   |     1
2022-04-01 23:50:10 UTC | dog_shampoo   |     1
2022-04-01 23:50:11 UTC | dog_shampoo   |     1

And then the maximum will be:

SELECT
     product_name,
     MAX(concurrent_purchases) AS max_concurrent_purchases
FROM result1
GROUP BY 1

from which the final result will be

product_name   | max_concurrent_purchases
dog_shampoo    |     2
dog_toy        |     2

But if I define concurrent purchases as within 2 seconds of each other to build an output like below:

timestamp                                          | product_name  | concurrent_purchase
2022-04-01 23:49:07 UTC - 2022-04-01 23:49:09 UTC  | dog_shampoo   |     3
2022-04-01 23:49:09 UTC - 2022-04-01 23:49:11 UTC  | dog_shampoo   |     0
2022-04-01 23:49:09 UTC - 2022-04-01 23:49:11 UTC  | dog_shampoo   |     0
                 .                                 | dog_shampoo   |     0
                 .                                 | dog_shampoo   |     0
                 .                                 | dog_shampoo   |     0
2022-04-01 23:50:10 UTC - 2022-04-01 23:50:12 UTC  | dog_shampoo   |     2
2022-04-01 23:49:07 UTC - 2022-04-01 23:49:09 UTC  | dog_toy       |     3

That'll give this result:

product_name   | max_concurrent_purchases
dog_shampoo    |     3
dog_toy        |     3

And I am interested in making a flexible solution where it is easy to change the interval values (2 seconds - 10 seconds) and run for different values.

I figured I can get the minimum timestamp for each product name and start incrementing by maybe a control flow statement but I am not very familiar with setting variables and looping in SQL and wasn’t sure how slow a solution like this would be.

How can I build a query that gives me the desired output like below with different values for the interval?

interval (seconds) | product_name   | concurrent_purchase
        0          | dog_shampoo    |     2
        0          | dog_toy        |     2
        2          | dog_shampoo    |     3
        2          | dog_toy        |     3

Upvotes: 0

Answers (3)

Mikhail Berlyant

Reputation: 173028

Consider below approach

select product_name, max(concurrent_purchase) as max_concurrent_purchases
from (
  select *, count(*) over win as concurrent_purchase 
  from your_table
  window win as (
    partition by product_name 
    order by unix_seconds(timestamp) 
    range between current row and 2 following
  )
)
group by product_name

if applied to sample data in your question - output is

If to use range between current row and 1 following - the output is

Upvotes: 1

Jaytiger

Reputation: 12254

Consider below:

You can adjust an interval via interval_value variable.

DECLARE interval_value INT64 DEFAULT 2; 

CREATE TEMP TABLE sample (
    timestamp TIMESTAMP,
    customer_id STRING,
    product_name STRING
);
INSERT INTO sample VALUES
('2022-04-01 23:49:07 UTC', 'a23', 'dog_shampoo'),
('2022-04-01 23:49:07 UTC', 'a33', 'dog_shampoo'),
('2022-04-01 23:49:07 UTC', 'a45', 'dog_toy'),
('2022-04-01 23:49:09 UTC', 'a67', 'dog_shampoo'),
('2022-04-01 23:49:09 UTC', 'a66', 'dog_toy'),
('2022-04-01 23:49:09 UTC', 'a63', 'dog_toy'),
('2022-04-01 23:50:10 UTC', 'a50', 'dog_shampoo'),
('2022-04-01 23:50:11 UTC', 'a51', 'dog_shampoo')
;

WITH intervals AS (
  SELECT evt_grp, product_name, COUNT(1) AS purchases FROM (
    SELECT *, DIV(TIMESTAMP_DIFF(timestamp,  MIN(timestamp) OVER (), SECOND), interval_value + 1) evt_grp
      FROM sample
  )
  GROUP BY 1, 2 
)
SELECT product_name, MAX(purchases) AS max_concurrent_purchasesax 
  FROM intervals
 GROUP BY 1
;

If you set interval_value as 1, output will be:

If interval_value is 2, output will be:

Upvotes: 0

improta

Reputation: 26

I made a query here to solve this but not using loops. What I explored was the GENERATE_TIMESTAMP_ARRAY function (https://cloud.google.com/bigquery/docs/reference/standard-sql/array_functions#generate_timestamp_array) and the TIMESTAMP_ADD function (https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions#timestamp_add) making an array with the intervals and after exploding it joining with the original table.

Here is the code:

DECLARE window_size INT64;
SET window_size = 2;

WITH timestamp_array AS (
  SELECT GENERATE_TIMESTAMP_ARRAY(min(timestamp), max(timestamp), INTERVAL window_size SECOND) as ts_array
  FROM <your_table>
)
,products AS (
  SELECT *
  FROM UNNEST(['dog_shampoo', 'dog_toy']) AS product
)
,timestamp_ranges AS (
  SELECT flattened_timestamp AS start_time, TIMESTAMP_ADD(flattened_timestamp, INTERVAL window_size SECOND) AS end_time, p.product
  FROM timestamp_array
  CROSS JOIN UNNEST(timestamp_array.ts_array) AS flattened_timestamp
  CROSS JOIN products p
)
,products_sold AS (
  SELECT
    a.start_time
    ,a.end_time
    ,a.product
    ,b.timestamp
    ,b.customer_id
    ,IF (b.product_name IS NOT NULL, 1, 0) AS purchase_count
  FROM timestamp_ranges a
  LEFT JOIN <your_table> b
    ON a.start_time <= b.timestamp
    AND a.end_time > b.timestamp
    AND a.product = b.product_name
)
SELECT
  CONCAT(start_time, ' - ', end_time) AS time_range
  ,product
  ,sum(purchase_count) as concurrent_purchases
FROM products_sold
GROUP BY start_time, end_time, product
ORDER BY product, time_range
;

The only additional comment is that I considered a timestamp "in the range" when it's bigger or equal than the start time and less than the end time. With this rule, the results are not exactly the same as your example. But if you want to change this behaviour you just need to change the rule of the join here:

  LEFT JOIN <your_table> b
    ON a.start_time <= b.timestamp
    AND a.end_time > b.timestamp

Upvotes: 0

Ho do I a SQL query in BigQuery to calculate maximum events within a certain time interval?

Answers (3)

Related Questions