Reputation: 1267
I am wondering how best to do both a tumbling and sliding window in BigQuery standardSQL (not legacy).
Apache Spark makes this pretty easy - for a 'sliding' window of length 10 mins sliding every 5 mins, groupBy(window("10 minutes","5 minutes)) and for a tumbling window groupBy(window("10 minutes")).
Consider I have a simplified order row with:
orderId,
orderPlacedTimestamp,
orderTotals.grandTotalNet,
orderTotals.grandTotalGross
(note the struct on the totals)
and I need two total aggregate order sales value as:
I started this in using a SQL over
but got a bit messy with partitioning, timestamps and getting the right start and end windows. For each the result set should be:
windowStartTime,
windowEndTime,
windowTotalAmount
So the windows have a starting and ending time (which is crucial for sliding given the length of the window 10 is different to the sliding duration 5 so a given row can be in multiple windows as a result).
How do I do this in BigQuery?
UPDATE 11/06 WITH CURRENT SQL:
SELECT
TIMESTAMP_SECONDS(FIRST_VALUE(ts_5min*5*60) OVER(w)) as startWindowTime,
TIMESTAMP_SECONDS(LAST_VALUE(ts_5min*5*60) OVER(w)) as endWindowTime,
SUM(orderTotalNetConverted) OVER(ORDER BY ts_5min RANGE BETWEEN 1 PRECEDING AND CURRENT ROW) as windowSalesTotal
FROM (
SELECT
CAST(UNIX_SECONDS(TIMESTAMP_TRUNC(orderPlacedTimestamp, MINUTE))/60/5 AS INT64)
AS ts_5min,
orderTotalNetConverted
FROM orders
)
WINDOW w AS (ORDER BY ts_5min RANGE BETWEEN 1 PRECEDING AND CURRENT ROW)
ORDER BY startWindowTime desc
Upvotes: 1
Views: 1463
Reputation: 14014
Tumbling window is easier to do, it is just a regular GROUP BY
at hourly intervals:
SELECT
TIMESTAMP_TRUNC(orderPlacedTimestamp, HOUR),
SUM(orderTotals.grandTotalNet)
FROM T
GROUP BY 1
For sliding window, I would first normalize timestamps to 5 minute intervals using following:
TIMESTAMP_TRUNC(orderPlacedTimestamp, MINUTE)
to get to minute boundaryUNIX_SECONDS
to convert to seconds since epochCAST(UNIX_SECONDS(TIMESTAMP_TRUNC(orderPlacedTimestamp, MINUTE))/60/5 AS INT64)
Now you can use standard OVER()
clause to get 10 minute window, which means 2 such intervals at a time, and in order to get start time, use FIRST_VALUE
analytic function:
SELECT
orderId,
TIMESTAMP_SECONDS(FIRST_VALUE(ts_5min*5*60) OVER(w)) startWindowTime,
TIMESTAMP_ADD(TIMESTAMP_SECONDS(FIRST_VALUE(ts_5min*5*60) OVER(w)),
INTERVAL 10 MINUTE) endWindowTime,
SUM(grandTotalNet) OVER(w)
FROM (
SELECT
*,
CAST(UNIX_SECONDS(TIMESTAMP_TRUNC(orderPlacedTimestamp, MINUTE))/60/5 AS INT64)
AS ts_5min
FROM t
)
WINDOW w AS (ORDER BY ts_5min RANGE BETWEEN 1 PRECEDING AND CURRENT ROW)
Upvotes: 2
Reputation: 173210
Below is for sliding 10 MINUTE / 5 MINUTE window
#standardSQL
WITH starts AS (
SELECT TIMESTAMP_ADD(TIMESTAMP_TRUNC(first, HOUR), INTERVAL step MINUTE) AS start
FROM
(SELECT MIN(orderPlacedTimestamp) AS first, MAX(orderPlacedTimestamp) AS last FROM YourTable),
UNNEST(GENERATE_ARRAY(0, TIMESTAMP_DIFF(last, TIMESTAMP_TRUNC(first, HOUR), MINUTE) , 5)) AS step
)
SELECT
start,
SUM(orderTotals.grandTotalNet) AS net,
SUM(orderTotals.grandTotalGross) AS gross
FROM starts AS s JOIN YourTable AS t
ON t.orderPlacedTimestamp BETWEEN s.start AND TIMESTAMP_ADD(start, INTERVAL 10 MINUTE)
GROUP BY start
ORDER BY start
As Mosha mentioned in his answer - tumbling HOURLY window is easy - but if you will need different window side - above approach is more flexible (I think) and easy to adjust for any window size, type by changing respective 5
and 10
in query code
But overall - wanted to mention - using analytic function is more optimal than joining
Upvotes: 1