The Traveling Coder
The Traveling Coder

Reputation: 301

BigQuery: how to perform rolling timestamp window group count that produces row for each day

this is an extension to a question that I asked and resolved on StackOverflow here.

I'm a BigQuery and SQL novice and I wanted to construct a Standard SQL query that would group and count events over a rolling time window of X days. My data table looks like this:

event_id |    url    |          timestamp   
-----------------------------------------------------------
xx         a.html      2016-10-18 15:55:16 UTC
xx         a.html      2016-10-19 16:68:55 UTC
xx         a.html      2016-10-25 20:55:57 UTC
yy         b.html      2016-10-18 15:58:09 UTC
yy         a.html      2016-10-18 08:32:43 UTC
zz         a.html      2016-10-20 04:44:22 UTC
zz         c.html      2016-10-21 02:12:34 UTC

I'm tracking events that occur on urls. I want to know how many times each event occurred on each url during a rolling time period of X days. When I asked this question, I got a great answer:

WITH dailyAggregations AS (
  SELECT 
    DATE(ts) AS day, 
    url, 
    event_id, 
    UNIX_SECONDS(TIMESTAMP(DATE(ts))) AS sec, 
    COUNT(1) AS events 
  FROM yourTable
  GROUP BY day, url, event_id, sec
)
SELECT 
  url, event_id, day, events, 
  SUM(events) 
    OVER(PARTITION BY url, event_id ORDER BY sec 
      RANGE BETWEEN 259200 PRECEDING AND CURRENT ROW
  ) AS rolling4daysEvents
FROM dailyAggregations

where 259200 is 3 days in seconds (3x24x3600). As I understand it, this query creates an intermediate table that groups and counts events by day. It also converts the timestamp field into its unix second equivalent. Then it sums up the events using a window that is measured in seconds.

Now this will produce a table with correct running totals, but it does not guarantee a row for every date, url, and event. In other words, there will be dates missing from the resultant table if there were dates when a given event never occurred on a given url. Bottom line, can I modify the above query (or construct a different query) that will correctly produce values for rolling4daysEvents for each date in an interval? eg: like an interval defined as:

SELECT *
  FROM UNNEST (GENERATE_DATE_ARRAY('2016-08-28', '2016-11-06')) AS day
  ORDER BY day ASC

Thanks!

Upvotes: 3

Views: 4118

Answers (1)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173210

WITH dailyAggregations AS (
  SELECT 
    DATE(ts) AS day, 
    url, 
    event_id, 
    UNIX_SECONDS(TIMESTAMP(DATE(ts))) AS sec, 
    COUNT(1) AS events 
  FROM yourTable
  GROUP BY day, url, event_id, sec
),
calendar AS (
  SELECT day
  FROM UNNEST (GENERATE_DATE_ARRAY('2016-08-28', '2016-11-06')) AS day
)
SELECT 
  c.day, url, event_id, events, 
  SUM(events) 
    OVER(PARTITION BY url, event_id ORDER BY sec 
      RANGE BETWEEN 259200 PRECEDING AND CURRENT ROW
  ) AS rolling4daysEvents
FROM calendar AS c
LEFT JOIN dailyAggregations AS a
ON a.day = c.day

Upvotes: 2

Related Questions