Josh
Josh

Reputation: 4806

Cumulatively applying an operation to successive rows in a pandas DataFrame

I have a pandas DataFrame that looks like the following:

sample = pd.DataFrame([[2,3],[4,5],[6,7],[8,9]],
                      index=pd.date_range('2017-08-01','2017-08-04'),
                      columns=['A','B'])

             A   B
2017-08-01   2   3
2017-08-02   4   5
2017-08-03   6   7
2017-08-04   8   9

I'd like to cumulatively multiply the values down the columns. Using column A as an example, the second row becomes 2*4, the third row becomes 2*4*6, and the last row becomes 2*4*6*8. Same for column B. So, the desired result is:

             A    B
2017-08-01   2    3
2017-08-02   8    15
2017-08-03   48   105
2017-08-04   384  945

There must be some built-in way to do this, but I'm having issues even doing it with for loops due to chained assignment issues.

Upvotes: 2

Views: 523

Answers (3)

cs95
cs95

Reputation: 402813

Use DataFrame.cumprod

out = sample.cumprod()
print(out)
              A    B
2017-08-01    2    3
2017-08-02    8   15
2017-08-03   48  105
2017-08-04  384  945

You can also use np.cumprod on the values:

sample[:] = np.cumprod(sample.values, axis=0)
print(sample)
              A    B
2017-08-01    2    3
2017-08-02    8   15
2017-08-03   48  105
2017-08-04  384  945

Finally, using itertools.accumulate (just for fun):

from itertools import accumulate
from operator import mul

pd.DataFrame(np.column_stack([
                 list(accumulate(sample[c], mul)) for c in sample.columns]), 
             index=sample.index, 
             columns=sample.columns)

              A    B
2017-08-01    2    3
2017-08-02    8   15
2017-08-03   48  105
2017-08-04  384  945

Upvotes: 5

jezrael
jezrael

Reputation: 863166

Use DataFrame.cumprod:

print (sample.cumprod())
              A    B
2017-08-01    2    3
2017-08-02    8   15
2017-08-03   48  105
2017-08-04  384  945

Alternative numpy.cumprod:

print (np.cumprod(sample))
              A    B
2017-08-01    2    3
2017-08-02    8   15
2017-08-03   48  105
2017-08-04  384  945

Timings:

np.random.seed(334)
N = 2000
df = pd.DataFrame({'A': np.random.choice([1,2], N, p=(0.99, 0.01)),
                   'B':np.random.choice([1,2], N, p=(0.99, 0.01))})
print (df)

In [31]: %timeit (df.cumprod())
The slowest run took 4.32 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 150 µs per loop

In [32]: %timeit (np.cumprod(df))
10000 loops, best of 3: 165 µs per loop

In [33]: %timeit (df.apply(np.cumprod))
The slowest run took 5.51 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.23 ms per loop

Upvotes: 4

Max
Max

Reputation: 1363

data frame has a method named cumprod. you can use it as follows

sample.cumprod()

Upvotes: 1

Related Questions