Pablo Quiroz
Pablo Quiroz

Reputation: 35

Moving Distinct Count in Big Query (SQL syntax)

I have sort of a weird one for all the SQL masters out there. I need to get the distinct count of items in a moving window of 14 days. I tried dense_rank but it didn't specify (or I did not know how to) specify the 14 day moving window.

For simplicity my dataset has 3 columns.

  1. store (string)
  2. item code (string)
  3. date (date)

A quick example of my endo goal would be the following:

So then for day 1 my uniques would be 4, Day 2 my uniques would be 5 and day 3 my uniques would be 6 (1,2,3,4,5,6)

Once I get to day 15 I would ignore values found in day 1 and only take days 2-15

Any and all help would be greatly appreciated.

Upvotes: 0

Views: 359

Answers (2)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172944

Another option to consider - with use of HyperLogLog++ functions - so it consumes less resources and faster

select store, date, 
  ( select hll_count.merge(sketch)
    from t.sketches_14days sketch 
  ) distinct_items_count
from (
  select store, date, 
    array_agg(daily_sketch) over(partition by store order by unix_date(date) range between 13 preceding and current row) sketches_14days
  from (
    select store, date, hll_count.init(item_code) daily_sketch
    from your_table
    group by store, date
  )
) t      

Note:

HLL++ functions are approximate aggregate functions. Approximate aggregation typically requires less memory than exact aggregation functions, like COUNT(DISTINCT), but also introduces statistical uncertainty. This makes HLL++ functions appropriate for large data streams for which linear memory usage is impractical, as well as for data that is already approximate.

Upvotes: 0

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172944

Consider below approach

select store, date, 
  ( select count(distinct item) 
    from t.items item
  ) distinct_items_count
from (
  select store, date, any_value(items) items
  from (
    select store, date, 
      array_agg(item_code) over(partition by store order by unix_date(date) range between 13 preceding and current row) items
    from your_table
  )
  group by store, date
) t

Upvotes: 2

Related Questions