Reputation: 193
I'm trying to use Redshift SUM() and window functions to perform a cumulative sum. My data looks like this:
ID | item_date | item_count |
---|---|---|
12 | 01/01/2019 | 11 |
12 | 02/01/2019 | 8 |
12 | 03/01/2019 | 0 |
12 | 04/01/2019 | 5 |
12 | 05/01/2019 | 21 |
12 | 06/01/2019 | 0 |
Currently, my summation looks like this:
SUM(item_count) over (partition by ID order by item_date rows unbounded preceding) as cumulative_count
But it produces this output:
ID | item_date | item_count | cumulative_count |
---|---|---|---|
12 | 01/01/2019 | 11 | 11 |
12 | 02/01/2019 | 8 | 19 |
12 | 03/01/2019 | 0 | 0 |
12 | 04/01/2019 | 5 | 24 |
12 | 05/01/2019 | 21 | 45 |
12 | 06/01/2019 | 0 | 0 |
The behavior is correct EXCEPT when item_count = 0. Obviously my desired output would be:
ID | item_date | item_count | cumulative_count |
---|---|---|---|
12 | 01/01/2019 | 11 | 11 |
12 | 02/01/2019 | 8 | 19 |
12 | 03/01/2019 | 0 | 19 |
12 | 04/01/2019 | 5 | 24 |
12 | 05/01/2019 | 21 | 45 |
12 | 06/01/2019 | 0 | 45 |
I've looked into using the LAST_VALUE()
function as a way to backfill the zero-values but in redshift you can't nest window functions.
Has anyone seen this before?
Upvotes: 0
Views: 2209
Reputation: 11032
Redshift is a tried and true database with years under its belt so for there to be a bug in basic functionality seems unlikely but should be checked out. I through together this test case SQL and ran it on my cluster and it produced the expected results.
create table test (ID int, item_date date, item_count int);
insert into test values
(12, '01/01/2019', 11),
(12, '02/01/2019', 8),
(12, '03/01/2019', 0),
(12, '04/01/2019', 5),
(12, '05/01/2019', 21),
(12, '06/01/2019', 0);
select *, SUM(item_count) over (partition by ID order by item_date rows unbounded preceding) as cumulative_count
from test;
and it produced:
id | item_date | item_count | cumulative_count
---+------------+------------+-----------------
12 | 2019-01-01 | 11 | 11
12 | 2019-02-01 | 8 | 19
12 | 2019-03-01 | 0 | 19
12 | 2019-04-01 | 5 | 24
12 | 2019-05-01 | 21 | 45
12 | 2019-06-01 | 0 | 45
My cluster's version is Redshift 1.0.34272
Does this test code produce the correct answer on your cluster? If it does then there is something subtle going on with your query/data/situation. If not then I'd package it up and submit a support ticket.
====================================================
Pondering this and I had a thought on how this could have happened. If you IDs are text and there are non-printing chars in them then they are seen as a different partition. For example:
drop table if exists test;
create table test (ID varchar(8), item_date date, item_count int);
insert into test values
('12', '01/01/2019', 11),
('12', '02/01/2019', 8),
('12 ', '03/01/2019', 0),
('12', '04/01/2019', 5),
('12', '05/01/2019', 21),
('12 ', '06/01/2019', 0);
select *, SUM(item_count) over (partition by ID order by item_date rows unbounded preceding) as cumulative_count
from test
order by item_date;
Now this is just one way this could be happening. I'm sure there are others.
Upvotes: 2