decipher
decipher

Reputation: 53

Rolling grouped cumulative sum

I'm looking to create a rolling grouped cumulative sum. I can get the result via iteration, but wanted to see if there was a more intelligent way.

Here's what the source data looks like:

Per C   V
1   c   3
1   a   4
1   c   1
2   a   6
2   b   5
3   j   7
4   x   6
4   x   5
4   a   9
5   a   2
6   c   3
6   k   6

Here is the desired result:

Per C   V
1   c   4
1   a   4
2   c   4
2   a   10
2   b   5
3   c   4
3   a   10
3   b   5
3   j   7
4   c   4
4   a   19
4   b   5
4   j   7
4   x   11
5   c   4
5   a   21
5   b   5
5   j   7
5   x   11
6   c   7
6   a   21
6   b   5
6   j   7
6   x   11
6   k   6

Upvotes: 5

Views: 635

Answers (3)

unutbu
unutbu

Reputation: 880777

You could use pivot_table/cumsum:

(df.pivot_table(index='Per', columns='C', values='V', aggfunc='sum')
   .fillna(0) 
   .cumsum(axis=0)
   .replace(0, np.nan)
   .stack().reset_index())

yields

    Per  C     0
0     1  a   4.0
1     1  c   4.0
2     2  a  10.0
3     2  b   5.0
4     2  c   4.0
5     3  a  10.0
6     3  b   5.0
7     3  c   4.0
8     3  j   7.0
9     4  a  19.0
10    4  b   5.0
11    4  c   4.0
12    4  j   7.0
13    4  x  11.0
14    5  a  21.0
15    5  b   5.0
16    5  c   4.0
17    5  j   7.0
18    5  x  11.0
19    6  a  21.0
20    6  b   5.0
21    6  c   7.0
22    6  j   7.0
23    6  k   6.0
24    6  x  11.0

On the plus side, I think the pivot_table/cumsum approach helps convey the meaning of the calculation pretty well. Given the pivot_table, the calculation is essentially a cumulative sum down each column:

In [131]: df.pivot_table(index='Per', columns='C', values='V', aggfunc='sum')
Out[131]: 
C      a    b    c    j    k     x
Per                               
1    4.0  NaN  4.0  NaN  NaN   NaN
2    6.0  5.0  NaN  NaN  NaN   NaN
3    NaN  NaN  NaN  7.0  NaN   NaN
4    9.0  NaN  NaN  NaN  NaN  11.0
5    2.0  NaN  NaN  NaN  NaN   NaN
6    NaN  NaN  3.0  NaN  6.0   NaN

On the negative side, the need to fuss with 0's and NaNs is not ideal. We need 0's for the cumsum, but we need NaNs to make unwanted rows to disappear when the DataFrame is stacked.

The pivot_table/cumsum approach also offers a considerable speed advantage over using_concat, but piRSquared's solution is the fastest. On a 1000-row df:

In [169]: %timeit using_reindex2(df)
100 loops, best of 3: 6.86 ms per loop

In [152]: %timeit using_reindex(df)
100 loops, best of 3: 8.36 ms per loop

In [80]: %timeit using_pivot(df)
100 loops, best of 3: 8.58 ms per loop

In [79]: %timeit using_concat(df)
10 loops, best of 3: 84 ms per loop

Here is the setup I used for the benchmark:

import numpy as np
import pandas as pd

def using_pivot(df):
    return (df.pivot_table(index='P', columns='C', values='V', aggfunc='sum')
             .fillna(0)
             .cumsum(axis=0)
             .replace(0, np.nan)
             .stack().reset_index())

def using_reindex(df):
    """
    https://stackoverflow.com/a/49097572/190597 (piRSquared)
    """
    s = df.set_index(['P', 'C']).V.sum(level=[0, 1])

    return s.reindex(
        pd.MultiIndex.from_product(s.index.levels, names=s.index.names),
        fill_value=0
    ).groupby('C').cumsum().loc[lambda x: x > 0].reset_index()

def using_reindex2(df):
    """
    https://stackoverflow.com/a/49097572/190597 (piRSquared)
    with first line changed
    """
    s = df.groupby(['P', 'C'])['V'].sum()
    return s.reindex(
        pd.MultiIndex.from_product(s.index.levels, names=s.index.names),
        fill_value=0
    ).groupby('C').cumsum().loc[lambda x: x > 0].reset_index()

def using_concat(df):
    """
    https://stackoverflow.com/a/49095139/190597 (Allen)
    """
    return (pd.concat([df.loc[df.P<=i][['C','V']].assign(P=i) 
                       for i in df.P.unique()])
              .groupby(by=['P','C'])
              .sum()
              .reset_index())

def make(nrows):
    df = pd.DataFrame(np.random.randint(50, size=(nrows,3)), columns=list('PCV'))
    return df

df = make(1000)

Upvotes: 3

piRSquared
piRSquared

Reputation: 294516

If you set the index to be 'Per' and 'C', you can first accumulate over those index levels. Then I decided to reindex the resulting series by the the product of the index levels in order to get all possibilities while filling in new indices with zero.

After this, I use groupby, cumsum, and remove zeros.

s = df.set_index(['Per', 'C']).V.sum(level=[0, 1])

s.reindex(
    pd.MultiIndex.from_product(s.index.levels, names=s.index.names),
    fill_value=0
).groupby('C').cumsum().loc[lambda x: x > 0].reset_index()

    Per  C   V
0     1  a   4
1     1  c   4
2     2  a  10
3     2  b   5
4     2  c   4
5     3  a  10
6     3  b   5
7     3  c   4
8     3  j   7
9     4  a  19
10    4  b   5
11    4  c   4
12    4  j   7
13    4  x  11
14    5  a  21
15    5  b   5
16    5  c   4
17    5  j   7
18    5  x  11
19    6  a  21
20    6  b   5
21    6  c   7
22    6  j   7
23    6  k   6
24    6  x  11

Upvotes: 3

Allen Qin
Allen Qin

Reputation: 19957

This is a very interesting problem. Try below to see if it works for you.

(
    pd.concat([df.loc[df.Per<=i][['C','V']].assign(Per=i) for i in df.Per.unique()])
    .groupby(by=['Per','C'])
    .sum()
    .reset_index()
)

Out[197]: 
    Per  C   V
0     1  a   4
1     1  c   4
2     2  a  10
3     2  b   5
4     2  c   4
5     3  a  10
6     3  b   5
7     3  c   4
8     3  j   7
9     4  a  19
10    4  b   5
11    4  c   4
12    4  j   7
13    4  x  11
14    5  a  21
15    5  b   5
16    5  c   4
17    5  j   7
18    5  x  11
19    6  a  21
20    6  b   5
21    6  c   7
22    6  j   7
23    6  k   6
24    6  x  11

Upvotes: 3

Related Questions