Crcordova
Crcordova

Reputation: 21

Count cumulative true Value

I have the following column with True and False boolean values. I want create a new column performing a cumulative sum on the True values and if the value is False reset the count, like this:

   bool    count
0   False  0 
1   True   1   
2   True   2
3   True   3
4   False  0
5   True   1   
6   True   2   
7   False  0   
8   False  0

Upvotes: 2

Views: 922

Answers (3)

jezrael
jezrael

Reputation: 863166

If performance is important use this vectorized solution:

cumsum = df['bool'].cumsum()
df['count1'] = cumsum.sub(cumsum.where(~df['bool']).ffill().fillna(0).astype(int))
print (df)
    bool  count  count1
0  False      0       0
1   True      1       1
2   True      2       2
3   True      3       3
4  False      0       0
5   True      1       1
6   True      2       2
7  False      0       0
8  False      0       0

Explanation:

First use cumulative sum in boolean column, Trues are processing like 1 to variable cumsum (in column cumsum).

Then replace if Falses by invert mask by ~ in Series.where (in column add_NaNs), so possible forward filling missing values by previous values by ffill (in column forward_fill_NaNs).

It is possible use for subtracting by original Series cumsum (in column subtract).

Last if some NaNs in start of column (not replaced by ffill) use Series.fillna and convert output to integers (in column out).

cumsum = df['bool'].cumsum()

print (df.assign(cumsum = cumsum,
                 add_NaNs = cumsum.where(~df['bool']),
                 forward_fill_NaNs= cumsum.where(~df['bool']).ffill(),
                 subtract = cumsum.sub(cumsum.where(~df['bool']).ffill()),
                 out = cumsum.sub(cumsum.where(~df['bool']).ffill().fillna(0).astype(int)))
       )
    bool  count  cumsum  add_NaNs  forward_fill_NaNs  subtract  out
0  False      0       0       0.0                0.0       0.0    0
1   True      1       1       NaN                0.0       1.0    1
2   True      2       2       NaN                0.0       2.0    2
3   True      3       3       NaN                0.0       3.0    3
4  False      0       3       3.0                3.0       0.0    0
5   True      1       4       NaN                3.0       1.0    1
6   True      2       5       NaN                3.0       2.0    2
7  False      0       5       5.0                5.0       0.0    0
8  False      0       5       5.0                5.0       0.0    0

Performance (in sample data, best test in real):

#9k rows    
df = pd.concat([df] * 1000, ignore_index=True)

In [17]: %%timeit
    ...: current_count = 0
    ...: for index, row in df.iterrows():
    ...:     if (row['bool']):
    ...:         current_count += 1
    ...:     else:
    ...:         current_count = 0
    ...:     df.at[index, 'count1'] = current_count
    ...:     
1.06 s ± 5.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [18]: %%timeit 
    ...: df['count1'] = df.groupby(df['bool'].astype(int).diff().ne(0).cumsum())['bool'].cumsum()
    ...: 
    ...: 
2.91 ms ± 39.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [19]: %%timeit
    ...: cumsum = df['bool'].cumsum()
    ...: df['count1'] = cumsum.sub(cumsum.where(~df['bool']).ffill().fillna(0).astype(int))
    ...: 
1.38 ms ± 7.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

#900k rows    
df = pd.concat([df] * 100000, ignore_index=True)


In [21]: %%timeit 
    ...: df['count1'] = df.groupby(df['bool'].astype(int).diff().ne(0).cumsum())['bool'].cumsum()
    ...: 
    ...: 
105 ms ± 971 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [22]: %%timeit
    ...: cumsum = df['bool'].cumsum()
    ...: df['count1'] = cumsum.sub(cumsum.where(~df['bool']).ffill().fillna(0).astype(int))
    ...: 
    ...: 
42.5 ms ± 419 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Upvotes: 0

Sadman Sakib
Sadman Sakib

Reputation: 605

There might be a more Pythonic way to do it, but here is a simple iterative approach:

current_count = 0
for index, row in data.iterrows():
    if (row['bool']):
        current_count += 1
    else:
        current_count = 0
    data.at[index, 'count'] = current_count

Here, data is the dataframe that initially has one column named bool, then we add the count column!

Upvotes: 1

user17242583
user17242583

Reputation:

Yes, this can be done, using a series of steps:

df['count'] = df.groupby(df['bool'].astype(int).diff().ne(0).cumsum())['bool'].cumsum()

Output:

>>> df

    bool  count
0  False      0
1   True      1
2   True      2
3   True      3
4  False      0
5   True      1
6   True      2
7  False      0
8  False      0

Explanation:

This code creates separate groups for all consecutive true values (1's) coming before a false value (0), then, treating the trues as 1's and the falses as 0's, computes the cumulative sum for each group, then concatenates the results together.

  1. df.groupby -
    1. df['bool'].astype(int) - Takes each value of bool, converts it to an int (true -> 1, false -> 0),
    2. .diff() - For each integer value, computes the difference between it an the previous value (so if the prev val was False and this is True, 1 (1 - 0); if prev was True and this True, 0 (1 - 1); etc.)
    3. .ne(0) - Converts all values that are not equal to 0 to true, and zeros to false (because (0 != 0) == False)
    4. .cumsum() - Calculates cumulative sum for true (1) values. This way, all the trues before any false (0) get their own unique number, which is returned to the groupby() call, thus grouping separately each group of trues before a false
  2. ['bool'].cumsum() - From each group of consecutive true values (1), get the cumulative sum those 1s.

Upvotes: 1

Related Questions