gussilago
gussilago

Reputation: 932

Pandas rolling sum, variating length

I will try and explain the problem I am currently having concerning cumulative sums on DataFrames in Python, and hopefully you'll grasp it!

Given a pandas DataFrame df with a column returns as such:

              returns
Date                                                          
2014-12-10    0.0000
2014-12-11    0.0200
2014-12-12    0.0500
2014-12-15    -0.0200
2014-12-16    0.0000

Applying a cumulative sum on this DataFrame is easy, just using e.g. df.cumsum(). But is it possible to apply a cumulative sum every X days (or data points) say, yielding only the cumulative sum of the last Y days (data points).

Clarification: Given daily data as above, how do I get the accumulated sum of the last Y days, re-evaluated (from zero) every X days?

Hope its clear enough,

Thanks, N

Upvotes: 2

Views: 1995

Answers (4)

gussilago
gussilago

Reputation: 932

Thanks to @DSM I managed to come up with a variation of his solution that actually does pretty much what I was looking for:

import numpy as np
import pandas as pd

df.resample("1w"), how={'A': np.sum})

Yields what I want for the example below:

rng = range(1,29)
dates = pd.date_range('1/1/2000', periods=len(rng))
r = pd.DataFrame(rng, index=dates, columns=['A'])
r2 = r.resample("1w", how={'A': np.sum})

Outputs:

>> print r
             A
2000-01-01   1
2000-01-02   2
2000-01-03   3
2000-01-04   4
2000-01-05   5
2000-01-06   6
2000-01-07   7
2000-01-08   8
2000-01-09   9
2000-01-10  10
2000-01-11  11
...
2000-01-25  25
2000-01-26  26
2000-01-27  27
2000-01-28  28

>> print r2
              A
2000-01-02    3
2000-01-09   42
2000-01-16   91
2000-01-23  140
2000-01-30  130

Even though it doesn't start "one week in" in this case (resulting in sum of 3 in the very first case), it does always get the correct rolling sum, starting on the previous date with initial value of zero.

Upvotes: 0

DSM
DSM

Reputation: 353009

"Every X days" and "every X data points" are very different; the following assumes you really mean the first, since you mention it more frequently.

If the index is a DatetimeIndex, you can resample to a daily frequency, take a rolling_sum, and then select only the original dates:

>>> pd.rolling_sum(df.resample("1d"), 2, min_periods=1).loc[df.index]
            returns
Date               
2014-12-10     0.00
2014-12-11     0.02
2014-12-12     0.07
2014-12-15    -0.02
2014-12-16    -0.02

or, step by step:

>>> df.resample("1d")
            returns
Date               
2014-12-10     0.00
2014-12-11     0.02
2014-12-12     0.05
2014-12-13      NaN
2014-12-14      NaN
2014-12-15    -0.02
2014-12-16     0.00
>>> pd.rolling_sum(df.resample("1d"), 2, min_periods=1)
            returns
Date               
2014-12-10     0.00
2014-12-11     0.02
2014-12-12     0.07
2014-12-13     0.05
2014-12-14      NaN
2014-12-15    -0.02
2014-12-16    -0.02

Upvotes: 1

user2707389
user2707389

Reputation: 827

I am not sure if there is a built in method but it does not seem very difficult to write one. for example, here is one for pandas series.

def cum(df, interval):
    all = []
    quotient = len(df)//interval
    intervals = range(quotient)
    for i in intervals:
        all.append(df[0:(i+1)*interval].sum())
    return pd.Series(all)
>>>s1 = pd.Series(range(20))
>>>print(cum(s1, 4))
0     6
1    28
2    66
3   120
4   190
dtype: int64

Upvotes: 0

ari
ari

Reputation: 4513

The way I would do it is with helper columns. It's a little kludgy but it should work:

numgroups = int(len(df)/(x-1))
df['groupby'] = sorted(list(range(numgroups))*x)[:len(df)]
df['mask'] = (([0]*(x-y)+[1]*(y))*numgroups)[:len(df)]
df['masked'] = df.returns*df['mask']
df.groupby('groupby').masked.cumsum()

Upvotes: 0

Related Questions