user16438416
user16438416

Reputation: 35

SQL BigQuery: For the current month, count number of distinct CustomerIDs in the previous 3 month

The following is the table with distinct CustomerID and Trunc_Date(Date,MONTH) called Date.

DATE CustomerID
2021-01-01 111
2021-01-01 112
2021-02-01 111
2021-03-01 113
2021-03-01 115
2021-04-01 119

For a given month M, I want to get the count of distinct CustomerIDs of the three previous months combined. Eg. for the month of July (7), I want to get the distinct count of CustomerIDs from the month of April (4), May (5) and until June (6). I do not want the customer in July (7) to be included for the record for July.

So the output will be like:

DATE CustomerID Count
2021-01-01 535
2021-02-01 657
2021-03-01 777
2021-04-01 436
2021-05-01 879
2021-06-01 691

Upvotes: 1

Views: 661

Answers (3)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173151

BigQuery ran out of memory running this since we have lots of data

In cases like this - the most scalable and performant approach is to use HyperLogLog++ functions as in example below

select distinct date, 
  ( select hll_count.merge(sketch)
    from t.prev_3_month_sketches sketch
  ) customerid_count
from (
  select *, 
    array_agg(customers_hhl) over(order by pos range between 3 preceding and 1 preceding) prev_3_month_sketches,
  from (
    select date_diff(date, '2000-01-01', month) pos, 
      min(date) date,
      hll_count.init(customerid) customers_hhl
    from `project.dataset.table`
    group by pos
  )
) t   

If applied to sample data in your question - output is

enter image description here

Note: HLL++ functions are approximate aggregate functions. Approximate aggregation typically requires less memory than exact aggregation functions, like COUNT(DISTINCT), but also introduces statistical uncertainty. This makes HLL++ functions appropriate for large data streams for which linear memory usage is impractical, as well as for data that is already approximate.

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1270733

You can also solve this problem by creating a record for each month in the three following months and then aggregating:

select date_add(date, interval n month) as month,
       count(distinct customerid)
from t cross join
     unnest(generate_array(1, 3, 1)) n
group by month;

Upvotes: 1

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173151

Consider below

select distinct date, 
  ( select count(distinct id)
    from t.prev_3_month_customers id
  ) customerid_count
from (
  select *, 
    array_agg(customerid) over(order by pos range between 3 preceding and 1 preceding) prev_3_month_customers,
  from (
    select *, date_diff(date, '2000-01-01', month) pos
    from `project.dataset.table`
  )
) t    

If applied to sample data in your question - output is

enter image description here

Upvotes: 2

Related Questions