Pandas monthly rolling window

I am looking to do a 'monthly' rolling window on daily data grouped by a category. The code below does not work as is, it leads to the following error:

ValueError: <DateOffset: months=1> is a non-fixed frequency

I know that I could use '30D' offset, however this would shift the date over time.

I'm looking for the sum of a window that spans from the x-th day of a month to that same x-th day of the J-th month. E.g. with J=1: 4th of July to 4th of August, 5th of July to 5th of August, 6th of July to 6th of August etc

I've been trying to figure this out for a few days now. Any suggestions or tipps would be very appreciated. Happy New Year.

MRE:

import pandas as pd
from io import StringIO

data = StringIO(
"""\
date          logret       category
2014-03-25    -0.01           A
2014-04-05    -0.02           A
2014-04-15    -0.03           A
2014-04-25    0.01            B
2014-05-05    0.03            B
2014-05-15    -0.01           A
2014-05-25    0.04            B
"""
)

df = pd.read_csv(data,sep="\s+",parse_dates=True,index_col="date")

J=1

df.groupby(['category'])['logret'].rolling(pd.DateOffset(months=J),min_periods=J*20).sum()

Upvotes: 1

Answers (3)

greenLeopard

Reputation: 73

I too needed to sum over a window spanning from the x-th day of a month to that same x-th day of the next month. It took ages to work out a solution, so I'm sharing it here. (For reference, I'm using Pandas.__version__ = 2.2.2)

My solution uses Pandas.cut() and Pandas.groupby(), I also explored resample(), rolling() and VariableOffsetWindowIndexer() but couldn't make them work for this because the length of the month-to-month windows are not the same.

The steps are (1) create a dataframe, (2) create the cuts for the time periods, then (3) use groupby to get the aggregated results.

import pandas as pd
import random

# create a datetime index
dti = pd.date_range(start='2024-03-04', end='2024-09-13', freq = 'D')
# and some random data
data = [random.random() for i in range(len(dti))]
df = pd.DataFrame(data, index=dti, columns=["values"])
df.head()

Now, get the values required for the cuts:

the reporting period ends of the 22nd of the month
offset the start and end dates by an extra month to catch the boundaries

anniversary_day = 22
start_date = df.index[0] - pd.offsets.MonthBegin(2)
end_date = df.index[-1] + pd.offsets.MonthBegin(1)

Now, create a new index with only the boundaries of the reporting periods:

pd.offsets.Day(x-1) will advance to the day before the anniversay day
pd.offsets.BDay() will increment to the following business day if the anniversary day is a weekend

dti2 = pd.date_range(start=start_date, end=end_date, freq='MS') + pd.offsets.Day(anniversary_day-1) + pd.offsets.BDay()
dti2

Next, use the date-time index dti2 as the bins argument to pd.cut.

And the column from the cut will by the input for the groupby operation. Also, minding whether or not to keep groups that have no values:

observed=False will include groups that have no values
observed=True will exclude groups that have no values

grouping_column = pd.cut(df.index, bins=dti2)
df.groupby(by=grouping_column, observed=True).sum()

I hope this helps someone out there.

Upvotes: 3

Markus Rother

Reputation: 434

In an intermediary step 'normalize' your timestamps, such that every month has 31 days, then aggregate, and finally drop the 'inserted' rows from your result.

That works as long as your aggregation has a neutral element.

create an index from the original df with all timestamps as strings
create another index with strings representing timestamps where all months have 31 days
merge, aggregate, etc.
select from aggregation by the index derived from original df
add new column to original df

import pandas as pd
from io import StringIO

data = StringIO(
"""\
date          logret       category
2014-03-25    -0.01           A
2014-04-05    -0.02           A
2014-04-15    -0.03           A
2014-04-25    0.01            B
2014-05-05    0.03            B
2014-05-15    -0.01           A
2014-05-25    0.04            B
"""
)

df = pd.read_csv(data,sep="\s+",parse_dates=True,index_col="date")
idx = df.index.strftime('%Y-%m-%d')

y0 = df.index[0].year
y1 = df.index[-1].year

padded = pd.DataFrame(index=[f'{y}-{m:02}-{d:02}' 
                             for y in range(y0,y1+1) 
                             for m in range(1, 13)
                             for d in range(1, 32)])[idx[0]:idx[-1]]

# Note that the rolling interval is exclusive at start
df.assign(rolling_aggregate=padded.join(df.set_index(idx)).fillna(0).rolling(31).agg(sum).loc[idx])

yields:

            logret category  rolling_aggregate
date                                          
2014-03-25   -0.01        A                NaN
2014-04-05   -0.02        A                NaN
2014-04-15   -0.03        A                NaN
2014-04-25    0.01        B              -0.04
2014-05-05    0.03        B               0.01
2014-05-15   -0.01        A               0.03
2014-05-25    0.04        B               0.06

Upvotes: 1

Ben Pap

Reputation: 2579

def MonthRollAvg(df, offset, avgColumn):
    #must have DateTimeIndex
    df2 = df.copy()
    df2.index = df2.index + pd.DateOffset(days = -offset)
    return df2.groupby([df2.index.year, df2.index.month])[avgColumn].mean()

MonthRollAvg(df, 5, 'logret')

The easiest way that I could think of is to use pd.DateOffset to move the dates then find the mean. So if you want it from the 6th to the 6th, you would use an offset of 5 to essentially make the 6th of the month the 1st of the month, then find the mean of each month. This will keep the months lengths as they are. You will just have to keep track of what day you are averaging between.

Upvotes: 1

Pandas monthly rolling window

Answers (3)

Related Questions