mgilson
mgilson

Reputation: 309821

Window functions with missing data

Assume that I have a table (MyTable) as follows:

item_id  |  date
----------------
     1   | 2016-06-08
     1   | 2016-06-07
     1   | 2016-06-05
     1   | 2016-06-04
     1   | 2016-05-31
     ...
     2   | 2016-06-08
     2   | 2016-06-06
     2   | 2016-06-04
     2   | 2016-05-31
     ...
     3   | 2016-05-31
     ...

I would like to build a weekly summary table that reports on a running 7 day window. The window would basically say "How many unique item_ids were reported in the preceding 7 days"?

So, in this case, the output table would look something like:

date      | weekly_ids
----------------------
2016-05-31|  3       # All 3 were present on the 31st
2016-06-01|  3       # All 3 were present on the 31st which is < 7 days before the 1st
2016-06-02|  3       # Same
2016-06-03|  3       # Same
2016-06-04|  3       # Same
2016-06-05|  3       # Same
2016-06-06|  3       # Same
2016-06-07|  3       # Same
2016-06-08|  2       # item 3 was not present for the entire last week so it does not add to the count.

I've tried:

SELECT
    item_id,
    date,
    MAX(present) OVER (
        PARTITION BY item_id
        ORDER BY date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS is_present
FROM (
    # Inner query
    SELECT
        item_id,
        date,
        1 AS present,
    FROM MyTable
)
GROUP BY date
ORDER BY date DESC

This feels like it is going in the right direction. But as it is, the window runs over the wrong time-frame when dates aren't present (too many dates) and it also doesn't output records for dates when the item_id wasn't present (even if it was present on the previous date). Is there a simple resolution to this problem?

If it's helpful and necessary

Upvotes: 3

Views: 1129

Answers (2)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172944

Try below example. It can give you direction to explore
Purely GBQ - Legacy SQL

SELECT date, items FROM (
  SELECT 
    date, COUNT(DISTINCT item_id) OVER(ORDER BY sec RANGE BETWEEN 60*60*24*2 PRECEDING AND CURRENT ROW) AS items
  FROM (
    SELECT 
      item_id, date, timestamp_to_sec(timestamp(date)) AS sec
    FROM (

      SELECT calendar.day AS date, MyTable.item_id AS item_id
      FROM (
        SELECT DATE(DATE_ADD(TIMESTAMP('2016-05-28'), pos - 1, "DAY")) AS day
        FROM (
             SELECT ROW_NUMBER() OVER() AS pos, *
             FROM (FLATTEN((
             SELECT SPLIT(RPAD('', 1 + DATEDIFF(TIMESTAMP(CURRENT_DATE()), TIMESTAMP('2016-05-28')), '.'),'') AS h
             FROM (SELECT NULL)),h
        )))
      ) AS calendar
      LEFT JOIN (
        SELECT date, item_id
        FROM
          (SELECT 1 AS item_id, '2016-06-08' AS date),
          (SELECT 1 AS item_id, '2016-06-07' AS date),
          (SELECT 1 AS item_id, '2016-06-05' AS date),
          (SELECT 1 AS item_id, '2016-06-04' AS date),
          (SELECT 1 AS item_id, '2016-05-28' AS date),
          (SELECT 2 AS item_id, '2016-06-08' AS date),
          (SELECT 2 AS item_id, '2016-06-06' AS date),
          (SELECT 2 AS item_id, '2016-06-04' AS date),
          (SELECT 2 AS item_id, '2016-05-31' AS date),
          (SELECT 3 AS item_id, '2016-05-31' AS date),
          (SELECT 3 AS item_id, '2016-06-05' AS date)
      ) AS MyTable
      ON calendar.day = MyTable.date
    )
  )
)
GROUP BY date, items
ORDER BY date  

Please note

  • oldest date - 2016-05-28 - is hardcoded in calendar subquery
  • window size is controled in RANGE BETWEEN 60*60*24*2 PRECEDING AND CURRENT ROW; if you need 7 days - the expression should be 60*60*24*6
  • have in mind specifics of COUNT(DISTINCT) in BigQuery Legacy SQL

Upvotes: 1

Ross Bush
Ross Bush

Reputation: 15155

I have created a temp table to hold dates, however, you probably would benefit from adding a permanent table to your database for these joins. Trust me it will cause less headaches.

DECLARE @my_table TABLE
(
    item_id int,
    date DATETIME
)
INSERT @my_table SELECT 1,'2016-06-08'
INSERT @my_table SELECT 1,'2016-06-07'
INSERT @my_table SELECT 1,'2016-06-05'
INSERT @my_table SELECT 1,'2016-06-04'
INSERT @my_table SELECT 1,'2016-05-31'
INSERT @my_table SELECT 2,'2016-06-08'
INSERT @my_table SELECT 2,'2016-06-06'
INSERT @my_table SELECT 2,'2016-06-04'
INSERT @my_table SELECT 2,'2016-05-31'
INSERT @my_table SELECT 3,'2016-05-31'

DECLARE @TrailingDays INT=7
DECLARE @LowDate DATETIME='01/01/2016'
DECLARE @HighDate DATETIME='12/31/2016'

DECLARE @Calendar TABLE(CalendarDate DATETIME)
DECLARE @LoopDate DATETIME=@LowDate

WHILE(@LoopDate<=@HighDate) BEGIN
    INSERT @Calendar SELECT @LoopDate
    SET @LoopDate=DATEADD(DAY,1,@LoopDate)
END 



SELECT
    date=HighDate,      
    weekly_ids=COUNT(DISTINCT item_id)
FROM
(
    SELECT
        HighDate=C.CalendarDate,
        LowDate=LAG(C.CalendarDate, @TrailingDays,0) OVER (ORDER BY C.CalendarDate)
    FROM
        @Calendar C 
    WHERE
        CalendarDate BETWEEN @LowDate AND @HighDate
)AS X
LEFT OUTER JOIN @my_table MT ON MT.date BETWEEN LowDate  AND HighDate
GROUP BY
    LowDate,
    HighDate

Upvotes: 2

Related Questions