Reputation: 289
Consider my dataframe, df
:
data data_binary sum_data
2 1 1
5 0 0
1 1 1
4 1 2
3 1 3
10 0 0
7 0 0
3 1 1
How can I calculate the cumulative sum of data_binary
within groups of contiguous 1
values?
The first group of 1
's had a single 1
and sum_data
has only a 1
. However, the second group of 1
's has 3 1
's and sum_data
is [1, 2, 3]
.
I've tried using np.where(df['data_binary'] == 1, df['data_binary'].cumsum(), 0)
, but that returns
array([1, 0, 2, 3, 4, 0, 0, 5])
Which is not what I want.
Upvotes: 13
Views: 9343
Reputation: 294218
You want to take the cumulative sum of data_binary
and subtract the most recent cumulative sum where data_binary
was zero.
b = df.data_binary
c = b.cumsum()
c.sub(c.mask(b != 0).ffill(), fill_value=0).astype(int)
0 1
1 0
2 1
3 2
4 3
5 0
6 0
7 1
Name: data_binary, dtype: int64
Explanation
Let's start by looking at each step side by side
cols = ['data_binary', 'cumulative_sum', 'nan_non_zero', 'forward_fill', 'final_result']
print(pd.concat([
b, c,
c.mask(b != 0),
c.mask(b != 0).ffill(),
c.sub(c.mask(b != 0).ffill(), fill_value=0).astype(int)
], axis=1, keys=cols))
data_binary cumulative_sum nan_non_zero forward_fill final_result
0 1 1 NaN NaN 1
1 0 1 1.0 1.0 0
2 1 2 NaN 1.0 1
3 1 3 NaN 1.0 2
4 1 4 NaN 1.0 3
5 0 4 4.0 4.0 0
6 0 4 4.0 4.0 0
7 1 5 NaN 4.0 1
The problem with cumulative_sum
is that the rows where data_binary
is zero, do not reset the sum. And that is the motivation for this solution. How do we "reset" the sum when data_binary
is zero? Easy! I slice the cumulative sum where data_binary
is zero and forward fill the values. When I take the difference between this and the cumulative sum, I've effectively reset the sum.
Upvotes: 21
Reputation: 862511
I think you can groupby
with DataFrameGroupBy.cumsum
by Series
, where first compare the next value by the shift
ed column if not equal (!=
) and then create groups by cumsum
. Last, replace 0
by column data_binary
with mask
:
print (df.data_binary.ne(df.data_binary.shift()).cumsum())
0 1
1 2
2 3
3 3
4 3
5 4
6 4
7 5
Name: data_binary, dtype: int32
df['sum_data1'] = df.data_binary.groupby(df.data_binary.ne(df.data_binary.shift()).cumsum())
.cumsum()
df['sum_data1'] = df['sum_data1'].mask(df.data_binary == 0, 0)
print (df)
data data_binary sum_data sum_data1
0 2 1 1 1
1 5 0 0 0
2 1 1 1 1
3 4 1 2 2
4 3 1 3 3
5 10 0 0 0
6 7 0 0 0
7 3 1 1 1
Upvotes: 9
Reputation: 2700
If you want the excellent piRSquared's answer in just one single command:
df['sum_data'] = df[['data_binary']].apply(
lambda x: x.cumsum().sub(x.cumsum().mask(x != 0).ffill(), fill_value=0).astype(int),
axis=0)
Note that the double squared bracket on the right hand side is necessary to make a one-column DataFrame instead of a Series in order to use apply
with the axis
argument (which is not available when apply
is used on Series).
Upvotes: 0