Reputation: 35
The following is the table with distinct CustomerID and Trunc_Date(Date,MONTH) called Date.
DATE | CustomerID |
---|---|
2021-01-01 | 111 |
2021-01-01 | 112 |
2021-02-01 | 111 |
2021-03-01 | 113 |
2021-03-01 | 115 |
2021-04-01 | 119 |
For a given month M, I want to get the count of distinct CustomerIDs of the three previous months combined. Eg. for the month of July (7), I want to get the distinct count of CustomerIDs from the month of April (4), May (5) and until June (6). I do not want the customer in July (7) to be included for the record for July.
So the output will be like:
DATE | CustomerID Count |
---|---|
2021-01-01 | 535 |
2021-02-01 | 657 |
2021-03-01 | 777 |
2021-04-01 | 436 |
2021-05-01 | 879 |
2021-06-01 | 691 |
Upvotes: 1
Views: 661
Reputation: 173151
BigQuery ran out of memory running this since we have lots of data
In cases like this - the most scalable and performant approach is to use HyperLogLog++ functions as in example below
select distinct date,
( select hll_count.merge(sketch)
from t.prev_3_month_sketches sketch
) customerid_count
from (
select *,
array_agg(customers_hhl) over(order by pos range between 3 preceding and 1 preceding) prev_3_month_sketches,
from (
select date_diff(date, '2000-01-01', month) pos,
min(date) date,
hll_count.init(customerid) customers_hhl
from `project.dataset.table`
group by pos
)
) t
If applied to sample data in your question - output is
Note: HLL++ functions are approximate aggregate functions. Approximate aggregation typically requires less memory than exact aggregation functions, like COUNT(DISTINCT), but also introduces statistical uncertainty. This makes HLL++ functions appropriate for large data streams for which linear memory usage is impractical, as well as for data that is already approximate.
Upvotes: 1
Reputation: 1270733
You can also solve this problem by creating a record for each month in the three following months and then aggregating:
select date_add(date, interval n month) as month,
count(distinct customerid)
from t cross join
unnest(generate_array(1, 3, 1)) n
group by month;
Upvotes: 1
Reputation: 173151
Consider below
select distinct date,
( select count(distinct id)
from t.prev_3_month_customers id
) customerid_count
from (
select *,
array_agg(customerid) over(order by pos range between 3 preceding and 1 preceding) prev_3_month_customers,
from (
select *, date_diff(date, '2000-01-01', month) pos
from `project.dataset.table`
)
) t
If applied to sample data in your question - output is
Upvotes: 2