Groupby over periods of time

Question

I have a table which contains ids, dates, a target (potentially multi class but for now binary where 1 is a fail) and a yearmonth column based on the date column. Below are the first 8 rows of this table:

row	id	date	target	yearmonth
0	A	2015-03-16	0	2015-03
1	A	2015-05-29	1	2015-05
2	A	2015-08-02	1	2015-08
3	A	2015-09-05	1	2015-09
4	A	2015-09-22	0	2015-09
5	A	2015-10-15	1	2015-10
6	A	2015-11-09	1	2015-11
7	B	2015-04-17	0	2015-04

I want to create lookback features for the last let's say 3 months so that for each single row, we take a look in the past and see the how that id performed over the last 3 months. So for ex for row 6, where date is 9th Nov 2015, the percentage of fails for id A in the last 3 calendaristic months (so in the whole of months of Aug, Sept & Oct) would be 75% (using rows 2-5).

df = pd.DataFrame({'id':['A','A','A','A','A','A','A','B'],'date' :['2015-03-16','2015-05-29','2015-08-02','2015-09-05','2015-09-22','2015-10-15','2015-11-09','2015-04-17'],'target':[0,1,1,1,0,1,1,0]} )


df['date'] = pd.to_datetime(df['date'], dayfirst = True)
df['yearmonth'] = df['date'].dt.to_period('M')

agg_dict = {
    "Total_Transactions": pd.NamedAgg(column='target', aggfunc='count'),
    "Fail_Count": pd.NamedAgg(column='target', aggfunc=(lambda x: len(x[x == 1]))),  
    "Perc_Monthly_Fails": pd.NamedAgg(column='target', aggfunc=(lambda x: len(x[x == 1])/len(x)*100))
}


df.groupby(['id','yearmonth']).agg(**agg_dict).reset_index(level = 1)

I've done an aggregation using id and month (see below) and I've tried things like rolling windows, but I could't find a way to actually aggregate looking back over a specific period for each single row. Any help is appreciated.

id	yearmonth	Total_Transactions	Fail_Count	Perc_Monthly_Fails
A	2015-03	1	0	0
A	2015-05	1	1	100
A	2015-08	1	1	100
A	2015-09	2	1	50
A	2015-10	1	1	100
A	2015-11	1	1	100
B	2015-04	1	0	0

ALollz · Accepted Answer

You can do this by merging the DataFrame with itself on 'id'.

First we'll create a first of month 'fom' column since your date logic wants to look back based on prior months, not the date specifically. Then we merge the DataFrame with itself, bringing along the index so we can assign the result back in the end.

With month offsets we can then filter that to only keeping the observations within 3 months of the observation for that row, and then we groupby the original index and take the mean of 'target' to get the percent fail, which we can just assign back (alignment on index).

If there are NaN in the output it's because that row had no observations in the prior 3 months so you can't calculate.

#df['date'] = pd.to_datetime(df['date'], dayfirst = True)
df['fom'] = df['date'].astype('datetime64[M]')   # Credit @anky

df1 = df.reset_index()
df1 = (df1.drop(columns='target').merge(df1, on='id', suffixes=['', '_past']))

df1 = df1[df1.fom_past.between(df1.fom-pd.offsets.DateOffset(months=3), 
                               df1.fom-pd.offsets.DateOffset(months=1))]

df['Pct_fail'] = df1.groupby('index').target.mean()*100

  id       date  target        fom    Pct_fail
0  A 2015-03-16       0 2015-03-01         NaN   # No Rows to Avg
1  A 2015-05-29       1 2015-05-01    0.000000   # Avg Rows 0
2  A 2015-08-02       1 2015-08-01  100.000000   # Avg Rows 1
3  A 2015-09-05       1 2015-09-01  100.000000   # Avg Rows 2
4  A 2015-09-22       0 2015-09-01  100.000000   # Avg Rows 2 
5  A 2015-10-15       1 2015-10-01   66.666667   # Avg Rows 2,3,4
6  A 2015-11-09       1 2015-11-01   75.000000   # Avg Rows 2,3,4,5
7  B 2015-04-17       0 2015-04-01         NaN   # No Rows to Avg

If you're having an issue with memory we can take a very slow loop approach, which subsets for each row and then calculates the average from that subset.

def get_prev_avg(row, df):
    df = df[df['id'].eq(row['id'])
            & df['fom'].between(row['fom']-pd.offsets.DateOffset(months=3), 
                                row['fom']-pd.offsets.DateOffset(months=1))]

    if not df.empty:
        return df['target'].mean()*100
    else:
        return np.NaN

#df['date'] = pd.to_datetime(df['date'], dayfirst = True)
df['fom'] = df['date'].astype('datetime64[M]')

df['Pct_fail'] = df.apply(lambda row: get_prev_avg(row, df), axis=1)

Groupby over periods of time

Answers (2)

Related Questions