Sam
Sam

Reputation: 89

Groupby over periods of time

I have a table which contains ids, dates, a target (potentially multi class but for now binary where 1 is a fail) and a yearmonth column based on the date column. Below are the first 8 rows of this table:

row id date target yearmonth
0 A 2015-03-16 0 2015-03
1 A 2015-05-29 1 2015-05
2 A 2015-08-02 1 2015-08
3 A 2015-09-05 1 2015-09
4 A 2015-09-22 0 2015-09
5 A 2015-10-15 1 2015-10
6 A 2015-11-09 1 2015-11
7 B 2015-04-17 0 2015-04

I want to create lookback features for the last let's say 3 months so that for each single row, we take a look in the past and see the how that id performed over the last 3 months. So for ex for row 6, where date is 9th Nov 2015, the percentage of fails for id A in the last 3 calendaristic months (so in the whole of months of Aug, Sept & Oct) would be 75% (using rows 2-5).

df = pd.DataFrame({'id':['A','A','A','A','A','A','A','B'],'date' :['2015-03-16','2015-05-29','2015-08-02','2015-09-05','2015-09-22','2015-10-15','2015-11-09','2015-04-17'],'target':[0,1,1,1,0,1,1,0]} )


df['date'] = pd.to_datetime(df['date'], dayfirst = True)
df['yearmonth'] = df['date'].dt.to_period('M')

agg_dict = {
    "Total_Transactions": pd.NamedAgg(column='target', aggfunc='count'),
    "Fail_Count": pd.NamedAgg(column='target', aggfunc=(lambda x: len(x[x == 1]))),  
    "Perc_Monthly_Fails": pd.NamedAgg(column='target', aggfunc=(lambda x: len(x[x == 1])/len(x)*100))
}


df.groupby(['id','yearmonth']).agg(**agg_dict).reset_index(level = 1)

I've done an aggregation using id and month (see below) and I've tried things like rolling windows, but I could't find a way to actually aggregate looking back over a specific period for each single row. Any help is appreciated.

id yearmonth Total_Transactions Fail_Count Perc_Monthly_Fails
A 2015-03 1 0 0
A 2015-05 1 1 100
A 2015-08 1 1 100
A 2015-09 2 1 50
A 2015-10 1 1 100
A 2015-11 1 1 100
B 2015-04 1 0 0

Upvotes: 2

Views: 107

Answers (2)

Sam
Sam

Reputation: 89

I have modified @ALollz code so that it applies better to my original dataset, where I have a multiclass target, and I would like to obtain PctFails for class 1 and 2, plus the nr of transactions, and I would need to group by different columns over different periods of times. Also, decided it's simpler and better to use the last x months prior to the date rather than the calendar months. So my solution to that was this:

df = pd.DataFrame({'Id':['A','A','A','A','A','A','A','B'],'Type':['T1','T3','T1','T2','T2','T1','T1','T3'],'date' :['2015-03-16','2015-05-29','2015-08-10','2015-09-05','2015-09-22','2015-11-08','2015-11-09','2015-04-17'],'target':[2,1,2,1,0,1,2,0]} )
df['date'] = pd.to_datetime(df['date'], dayfirst = True)

def get_prev_avg(row, df, columnname, lastxmonths):
    df = df[df[columnname].eq(row[columnname])
                & df['date'].between(row['date']-pd.offsets.DateOffset(months=lastxmonths), 
                                    row['date']-pd.offsets.DateOffset(days=1))]
    if not df.empty:
        NrTransactions= len(df['target'])
        PctMinorFails= (df['target'].where(df['target'] == 1).count())/len(df['target'])*100
        PctMajorFails= (df['target'].where(df['target'] == 2).count())/len(df['target'])*100
        return pd.Series([NrTransactions, PctMinorFails, PctMajorFails])
    else:
        return pd.Series([np.NaN, np.NaN, np.NaN])
    
for lastxmonths in [3, 4]:
    for columnname in ['Id','Type']:
        df[['NrTransactionsBy' + str(columnname) + 'Last' + str(lastxmonths) +'Months',
            'PctMinorFailsBy' + str(columnname) + 'Last' + str(lastxmonths) +'Months',
            'PctMajorFailsBy' + str(columnname) + 'Last' + str(lastxmonths) +'Months'
           ]]= df.apply(lambda row: get_prev_avg(row, df, columnname, lastxmonths), axis=1)

Each iteration takes a couple hours for my original dataset which is not great, but unsure how to optimise it further.

Upvotes: 1

ALollz
ALollz

Reputation: 59519

You can do this by merging the DataFrame with itself on 'id'.

First we'll create a first of month 'fom' column since your date logic wants to look back based on prior months, not the date specifically. Then we merge the DataFrame with itself, bringing along the index so we can assign the result back in the end.

With month offsets we can then filter that to only keeping the observations within 3 months of the observation for that row, and then we groupby the original index and take the mean of 'target' to get the percent fail, which we can just assign back (alignment on index).

If there are NaN in the output it's because that row had no observations in the prior 3 months so you can't calculate.

#df['date'] = pd.to_datetime(df['date'], dayfirst = True)
df['fom'] = df['date'].astype('datetime64[M]')   # Credit @anky

df1 = df.reset_index()
df1 = (df1.drop(columns='target').merge(df1, on='id', suffixes=['', '_past']))

df1 = df1[df1.fom_past.between(df1.fom-pd.offsets.DateOffset(months=3), 
                               df1.fom-pd.offsets.DateOffset(months=1))]

df['Pct_fail'] = df1.groupby('index').target.mean()*100

  id       date  target        fom    Pct_fail
0  A 2015-03-16       0 2015-03-01         NaN   # No Rows to Avg
1  A 2015-05-29       1 2015-05-01    0.000000   # Avg Rows 0
2  A 2015-08-02       1 2015-08-01  100.000000   # Avg Rows 1
3  A 2015-09-05       1 2015-09-01  100.000000   # Avg Rows 2
4  A 2015-09-22       0 2015-09-01  100.000000   # Avg Rows 2 
5  A 2015-10-15       1 2015-10-01   66.666667   # Avg Rows 2,3,4
6  A 2015-11-09       1 2015-11-01   75.000000   # Avg Rows 2,3,4,5
7  B 2015-04-17       0 2015-04-01         NaN   # No Rows to Avg

If you're having an issue with memory we can take a very slow loop approach, which subsets for each row and then calculates the average from that subset.

def get_prev_avg(row, df):
    df = df[df['id'].eq(row['id'])
            & df['fom'].between(row['fom']-pd.offsets.DateOffset(months=3), 
                                row['fom']-pd.offsets.DateOffset(months=1))]

    if not df.empty:
        return df['target'].mean()*100
    else:
        return np.NaN

#df['date'] = pd.to_datetime(df['date'], dayfirst = True)
df['fom'] = df['date'].astype('datetime64[M]')

df['Pct_fail'] = df.apply(lambda row: get_prev_avg(row, df), axis=1)

Upvotes: 3

Related Questions