Reputation: 4806
I have a pandas DataFrame
that looks like the following:
sample = pd.DataFrame([[2,3],[4,5],[6,7],[8,9]],
index=pd.date_range('2017-08-01','2017-08-04'),
columns=['A','B'])
A B
2017-08-01 2 3
2017-08-02 4 5
2017-08-03 6 7
2017-08-04 8 9
I'd like to cumulatively multiply the values down the columns. Using column A
as an example, the second row becomes 2*4
, the third row becomes 2*4*6
, and the last row becomes 2*4*6*8
. Same for column B. So, the desired result is:
A B
2017-08-01 2 3
2017-08-02 8 15
2017-08-03 48 105
2017-08-04 384 945
There must be some built-in way to do this, but I'm having issues even doing it with for loops due to chained assignment issues.
Upvotes: 2
Views: 523
Reputation: 402813
out = sample.cumprod()
print(out)
A B
2017-08-01 2 3
2017-08-02 8 15
2017-08-03 48 105
2017-08-04 384 945
You can also use np.cumprod
on the values:
sample[:] = np.cumprod(sample.values, axis=0)
print(sample)
A B
2017-08-01 2 3
2017-08-02 8 15
2017-08-03 48 105
2017-08-04 384 945
Finally, using itertools.accumulate
(just for fun):
from itertools import accumulate
from operator import mul
pd.DataFrame(np.column_stack([
list(accumulate(sample[c], mul)) for c in sample.columns]),
index=sample.index,
columns=sample.columns)
A B
2017-08-01 2 3
2017-08-02 8 15
2017-08-03 48 105
2017-08-04 384 945
Upvotes: 5
Reputation: 863166
Use DataFrame.cumprod
:
print (sample.cumprod())
A B
2017-08-01 2 3
2017-08-02 8 15
2017-08-03 48 105
2017-08-04 384 945
Alternative numpy.cumprod
:
print (np.cumprod(sample))
A B
2017-08-01 2 3
2017-08-02 8 15
2017-08-03 48 105
2017-08-04 384 945
Timings:
np.random.seed(334)
N = 2000
df = pd.DataFrame({'A': np.random.choice([1,2], N, p=(0.99, 0.01)),
'B':np.random.choice([1,2], N, p=(0.99, 0.01))})
print (df)
In [31]: %timeit (df.cumprod())
The slowest run took 4.32 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 150 µs per loop
In [32]: %timeit (np.cumprod(df))
10000 loops, best of 3: 165 µs per loop
In [33]: %timeit (df.apply(np.cumprod))
The slowest run took 5.51 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.23 ms per loop
Upvotes: 4
Reputation: 1363
data frame has a method named cumprod
. you can use it as follows
sample.cumprod()
Upvotes: 1