Reputation: 163
I'd like to be able to implement a "capped" cumulative sum in BigQuery using SQL.
Here's what I mean: I have a table whose rows have the amount by which a value is increased/decreased each day, but the value cannot go below 0 or above 100. I want to compute the cumulative sum of the changes to keep track of this value.
As an example, consider the following table:
day | change
--------------
1 | 70
2 | 50
3 | 20
4 | -30
5 | 10
6 | -90
7 | 20
I want to make a column that has the capped cumulative sum so that it looks like this:
day | change | capped cumsum
----------------------------
1 | 70 | 70
2 | 50 | 100
3 | 20 | 100
4 | -30 | 70
5 | 10 | 80
6 | -90 | 0
7 | 20 | 20
Simply doing SUM (change) OVER (ORDER BY day)
and capping the values at 100 and 0 won't work. I need some sort of recursive loop and I don't know how to implement this in BigQuery.
Eventually I'd also like to do this over partitions, so that if I have something like
day | class | change
--------------
1 | A | 70
1 | B | 12
2 | A | 50
2 | B | 83
3 | A | -30
3 | B | 17
4 | A | 10
5 | A | -90
6 | A | 20
I can do the capped cumulative sum partitioned over each class.
Upvotes: 0
Views: 937
Reputation: 173191
Eventually I'd also like to do this over partitions ...
Consider below solution
create temp function cap_value(value int64, lower_boundary int64, upper_boundary int64) as (
least(greatest(value, lower_boundary), upper_boundary)
);
with recursive temp_table as (
select *, row_number() over(partition by class order by day) as n from your_table
), iterations as (
select 1 as n, day, class, change, cap_value(change, 0, 100) as capped_cumsum
from temp_table
where n = 1
union all
select t.n, t.day, t.class, t.change, cap_value(i.capped_cumsum + t.change, 0, 100) as capped_cumsum
from temp_table t
join iterations i
on t.n = i.n + 1
and t.class = i.class
)
select * except(n) from iterations
order by class, day
if applied to sample data in your question - output is
Upvotes: 1
Reputation: 12274
Using RECURSIVE CTE
can be another option:
DECLARE sample ARRAY<STRUCT<day INT64, change INT64>> DEFAULT [
(1, 70), (2, 50), (3, 20), (4, -30), (5, 10), (6, -90), (7, 20)
];
WITH RECURSIVE ccsum AS (
SELECT 0 AS n, vals[OFFSET(0)] AS change,
CASE
WHEN vals[OFFSET(0)] > 100 THEN 100
WHEN vals[OFFSET(0)] < 0 THEN 0
ELSE vals[OFFSET(0)]
END AS cap_csum
FROM sample
UNION ALL
SELECT n + 1 AS n, vals[OFFSET(n + 1)] AS change,
CASE
WHEN cap_csum + vals[OFFSET(n + 1)] > 100 THEN 100
WHEN cap_csum + vals[OFFSET(n + 1)] < 0 THEN 0
ELSE cap_csum + vals[OFFSET(n + 1)]
END AS cap_csum
FROM ccsum, sample
WHERE n < ARRAY_LENGTH(vals) - 1
),
sample AS (
SELECT ARRAY_AGG(change ORDER BY day) vals FROM UNNEST(sample)
)
SELECT * EXCEPT(n) FROM ccsum ORDER BY n;
output:
Upvotes: 0
Reputation: 173191
I need some sort of recursive loop and I don't know how to implement this in BigQuery
Super naïve / cursor based approach
declare cumulative_change int64 default 0;
create temp table temp_table as (
select * , 0 as capped_cumsum from your_table where false
);
for rec in (select * from your_table order by day)
do
set cumulative_change = cumulative_change + rec.change;
set cumulative_change = case when cumulative_change < 0 then 0 when cumulative_change > 100 then 100 else cumulative_change end;
insert into temp_table (select rec.*, cumulative_change);
end for;
select * from temp_table order by day;
if applied to sample data in your question - output is
Slightly modified option with use of array instead of temp table
declare cumulative_change int64 default 0;
declare result array<struct<day int64, change int64, capped_cumsum int64>>;
for rec in (select * from your_table order by day)
do
set cumulative_change = cumulative_change + rec.change;
set cumulative_change = case when cumulative_change < 0 then 0 when cumulative_change > 100 then 100 else cumulative_change end;
set result = array(select as struct * from unnest(result) union all select as struct rec.*, cumulative_change);
end for;
select * from unnest(result) order by day;
P.S. I like none of above options so far :o)
Meantime, that approach might work for relatively small tables, set of data
Upvotes: 1