Pandas Rolling mean with GroupBy and Sort

I have a DataFrame that looks like:

f_period f_year f_month subject month year value
20140102 2014   1      a        1     2018 10
20140109 2014   1      a        1     2018 12
20140116 2014   1      a        1     2018 8
20140202 2014   2      a        1     2018 20
20140209 2014   2      a        1     2018 15
20140102 2014   1      b        1     2018 10
20140109 2014   1      b        1     2018 12
20140116 2014   1      b        1     2018 8
20140202 2014   2      b        1     2018 20
20140209 2014   2      b        1     2018 15

The f_period is the date when a forecast for a SKU (column subject) was made. The month and year column is the period for which the forecast was made. For example, the first row says that on 01/02/2018, the model was forecasting to set 10 units of product a in month 1 of year2018.

I am trying to create a rolling average prediction by subject, by month for 2 f_months. The DataFrame should look like:

f_period f_year f_month subject month year value mnthly_avg rolling_2_avg
20140102 2014   1      a        1     2018 10    10         13
20140109 2014   1      a        1     2018 12    10         13
20140116 2014   1      a        1     2018 8     10         13
20140202 2014   2      a        1     2018 20    17.5       null
20140209 2014   2      a        1     2018 15    17.5       null
20140102 2014   1      b        1     2018 10    10         13
20140109 2014   1      b        1     2018 12    10         13
20140116 2014   1      b        1     2018 8     10         13
20140202 2014   2      b        1     2018 20    17.5       null
20140209 2014   2      b        1     2018 15    17.5       null

Things I tried:

I was able to get mnthly_avg by :

data_df['monthly_avg'] = data_df.groupby(['f_month', 'f_year', 'year', 'month', 'period', 'subject']).\
        value.transform('mean')

I tried getting the rolling_2_avg :

rolling_monthly_df = data_df[['f_year', 'f_month', 'subject', 'month', 'year', 'value', 'f_period']].\
            groupby(['f_year', 'f_month', 'subject', 'month', 'year']).value.mean().reset_index()

rolling_monthly_df['rolling_2_avg'] = rolling_monthly_df.groupby(['subject', 'month']).\
            value.rolling(2).mean().reset_index(drop=True)

This gave me an unexpected output. I don't understand how it calculated the values for rolling_2_avg

How do I group by subject and month and then sort by f_month and then take the average of the next two-month average?

Upvotes: 1

Views: 5436

Answers (2)

m_h
m_h

Reputation: 585

I would be a bit careful with Josh's solution. If you want to group by the subject you can't use the rolling function like that as it will roll across subjects (i.e. it will eventually take the mean of a month from subject A and B, rather than giving a null which you might prefer).

An alternative can be to split the dataframe and run the rolling individually (I noticed that you want the nulls by the end of the dataframe, whereas you might wanna sort the dataframe before and after):

for unique_subject in df['subject'].unique():
    df_subject = df[df['subject'] == unique_subject]
    df_subject['rolling'] = df_subject['value'].rolling(window=2).mean()
    print(df_subject) # just to print, you may wanna concatenate these

Upvotes: 1

Josh Friedlander
Josh Friedlander

Reputation: 11657

Unless I'm misunderstanding it seems simpler than what you've done. What about this?

grp = pd.DataFrame(df.groupby(['subject', 'month', 'f_month'])['value'].sum())
grp['rolling'] = grp.rolling(window=2).mean()
grp

Output:

                          value rolling
subject     month   f_month         
a              1    1       30  NaN
               2            35  32.5
b              1    1       30  32.5
               2            35  32.5

Upvotes: 2

Related Questions