Reputation: 21
I have the following column with True
and False
boolean values. I want create a new column performing a cumulative sum on the True
values and if the value is False
reset the count, like this:
bool count
0 False 0
1 True 1
2 True 2
3 True 3
4 False 0
5 True 1
6 True 2
7 False 0
8 False 0
Upvotes: 2
Views: 922
Reputation: 863166
If performance is important use this vectorized solution:
cumsum = df['bool'].cumsum()
df['count1'] = cumsum.sub(cumsum.where(~df['bool']).ffill().fillna(0).astype(int))
print (df)
bool count count1
0 False 0 0
1 True 1 1
2 True 2 2
3 True 3 3
4 False 0 0
5 True 1 1
6 True 2 2
7 False 0 0
8 False 0 0
Explanation:
First use cumulative sum in boolean column, True
s are processing like 1
to variable cumsum
(in column cumsum
).
Then replace if False
s by invert mask by ~
in Series.where
(in column add_NaNs
), so possible forward filling missing values by previous values by ffill
(in column forward_fill_NaNs
).
It is possible use for subtracting by original Series cumsum
(in column subtract
).
Last if some NaN
s in start of column (not replaced by ffill
) use Series.fillna
and convert output to integers
(in column out
).
cumsum = df['bool'].cumsum()
print (df.assign(cumsum = cumsum,
add_NaNs = cumsum.where(~df['bool']),
forward_fill_NaNs= cumsum.where(~df['bool']).ffill(),
subtract = cumsum.sub(cumsum.where(~df['bool']).ffill()),
out = cumsum.sub(cumsum.where(~df['bool']).ffill().fillna(0).astype(int)))
)
bool count cumsum add_NaNs forward_fill_NaNs subtract out
0 False 0 0 0.0 0.0 0.0 0
1 True 1 1 NaN 0.0 1.0 1
2 True 2 2 NaN 0.0 2.0 2
3 True 3 3 NaN 0.0 3.0 3
4 False 0 3 3.0 3.0 0.0 0
5 True 1 4 NaN 3.0 1.0 1
6 True 2 5 NaN 3.0 2.0 2
7 False 0 5 5.0 5.0 0.0 0
8 False 0 5 5.0 5.0 0.0 0
Performance (in sample data, best test in real):
#9k rows
df = pd.concat([df] * 1000, ignore_index=True)
In [17]: %%timeit
...: current_count = 0
...: for index, row in df.iterrows():
...: if (row['bool']):
...: current_count += 1
...: else:
...: current_count = 0
...: df.at[index, 'count1'] = current_count
...:
1.06 s ± 5.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [18]: %%timeit
...: df['count1'] = df.groupby(df['bool'].astype(int).diff().ne(0).cumsum())['bool'].cumsum()
...:
...:
2.91 ms ± 39.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [19]: %%timeit
...: cumsum = df['bool'].cumsum()
...: df['count1'] = cumsum.sub(cumsum.where(~df['bool']).ffill().fillna(0).astype(int))
...:
1.38 ms ± 7.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#900k rows
df = pd.concat([df] * 100000, ignore_index=True)
In [21]: %%timeit
...: df['count1'] = df.groupby(df['bool'].astype(int).diff().ne(0).cumsum())['bool'].cumsum()
...:
...:
105 ms ± 971 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [22]: %%timeit
...: cumsum = df['bool'].cumsum()
...: df['count1'] = cumsum.sub(cumsum.where(~df['bool']).ffill().fillna(0).astype(int))
...:
...:
42.5 ms ± 419 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 0
Reputation: 605
There might be a more Pythonic way to do it, but here is a simple iterative approach:
current_count = 0
for index, row in data.iterrows():
if (row['bool']):
current_count += 1
else:
current_count = 0
data.at[index, 'count'] = current_count
Here, data
is the dataframe that initially has one column named bool
, then we add the count
column!
Upvotes: 1
Reputation:
Yes, this can be done, using a series of steps:
df['count'] = df.groupby(df['bool'].astype(int).diff().ne(0).cumsum())['bool'].cumsum()
Output:
>>> df
bool count
0 False 0
1 True 1
2 True 2
3 True 3
4 False 0
5 True 1
6 True 2
7 False 0
8 False 0
Explanation:
This code creates separate groups for all consecutive true values (1's) coming before a false value (0), then, treating the trues as 1's and the falses as 0's, computes the cumulative sum for each group, then concatenates the results together.
df.groupby
-
df['bool'].astype(int)
- Takes each value of bool
, converts it to an int (true -> 1, false -> 0),.diff()
- For each integer value, computes the difference between it an the previous value (so if the prev val was False and this is True, 1 (1 - 0
); if prev was True and this True, 0 (1 - 1
); etc.).ne(0)
- Converts all values that are not equal to 0 to true, and zeros to false (because (0 != 0) == False
).cumsum()
- Calculates cumulative sum for true (1) values. This way, all the trues before any false (0) get their own unique number, which is returned to the groupby()
call, thus grouping separately each group of trues before a false['bool'].cumsum()
- From each group of consecutive true values (1), get the cumulative sum those 1s.Upvotes: 1