Reputation: 1910
I'm having difficulty to solve a look-back or roll-over problem in dataframe or perhaps in groupby.
The following is a simple example of the dataframe I have:
fruit amount
20140101 apple 3
20140102 apple 5
20140102 orange 10
20140104 banana 2
20140104 apple 10
20140104 orange 4
20140105 orange 6
20140105 grape 1
…
20141231 apple 3
20141231 grape 2
I need to calculate the average value of 'amount' of each fruit in the previous 3 days for everyday, and create the following data frame:
fruit average_in_last 3 days
20140104 apple 4
20140104 orange 10
...
For example on 20140104, the previous 3 days are 20140101, 20140102 and 20140103 (note the date in the data frame is not continuous and 20140103 does not exist), the average amount of apple is (3+5)/2 = 4 and orange is 10/1=10, the rest is 0.
The sample data frame is very simple but the actual data frame is much more complicated and larger. Hope someone can shed some light on this, thank you in advance!
Upvotes: 10
Views: 14236
Reputation: 765
df1.index=pd.to_datetime(df1.index,format='%Y%m%d')
def function1(ss:pd.Series):
return ss.loc[ss.index<ss.index.max()].mean()
df1.reset_index().assign(av=df1.reset_index().groupby('fruit')
.apply(lambda dd:dd.rolling('4d',on='index').amount.apply(function1))
.droplevel(0)).set_index('index')
fruit amount av
index
2014-01-01 apple 3 NaN
2014-01-02 apple 5 3.0
2014-01-02 orange 10 NaN
2014-01-04 banana 2 NaN
2014-01-04 apple 10 4.0
2014-01-04 orange 4 10.0
2014-01-05 orange 6 7.0
2014-01-05 grape 1 NaN
2014-12-31 apple 3 NaN
2014-12-31 grape 2 NaN
Upvotes: 0
Reputation: 151
I also wanted to use rolling with groupby, this is why I landed on this page, but I believe that I have a workaround that is better than the previous suggestions.
You could do the following:
pivoted_df = pd.pivot_table(df, index='date', columns='fruits', values='amount')
average_fruits = pivoted_df.rolling(window=3).mean().stack().reset_index()
the .stack()
is not necessary, but will transform your pivot table back to a regular df
Upvotes: 8
Reputation: 920
Assuming we have a data frame like that in the beginning,
>>> df
fruit amount
2017-06-01 apple 1
2017-06-03 apple 16
2017-06-04 apple 12
2017-06-05 apple 8
2017-06-06 apple 14
2017-06-08 apple 1
2017-06-09 apple 4
2017-06-02 orange 13
2017-06-03 orange 9
2017-06-04 orange 9
2017-06-05 orange 2
2017-06-06 orange 11
2017-06-07 orange 6
2017-06-08 orange 3
2017-06-09 orange 3
2017-06-10 orange 13
2017-06-02 grape 14
2017-06-03 grape 16
2017-06-07 grape 4
2017-06-09 grape 15
2017-06-10 grape 5
>>> dates = [i.date() for i in pd.date_range('2017-06-01', '2017-06-10')]
>>> temp = (df.groupby('fruit')['amount']
.apply(lambda x: x.reindex(dates) # fill in the missing dates for each group)
.fillna(0) # fill each missing group with 0
.rolling(3)
.sum()) # do a rolling sum
.reset_index()
.rename(columns={'amount': 'sum_of_3_days',
'level_1': 'date'})) # rename date index to date col
>>> temp.head()
fruit date amount
0 apple 2017-06-01 NaN
1 apple 2017-06-02 NaN
2 apple 2017-06-03 17.0
3 apple 2017-06-04 28.0
4 apple 2017-06-05 36.0
# converts the date index into date column
>>> df = df.reset_index().rename(columns={'index': 'date'})
>>> df.merge(temp, on=['fruit', 'date'])
>>> df
date fruit amount sum_of_3_days
0 2017-06-01 apple 1 NaN
1 2017-06-03 apple 16 17.0
2 2017-06-04 apple 12 28.0
3 2017-06-05 apple 8 36.0
4 2017-06-06 apple 14 34.0
5 2017-06-08 apple 1 15.0
6 2017-06-09 apple 4 5.0
7 2017-06-02 orange 13 NaN
8 2017-06-03 orange 9 22.0
9 2017-06-04 orange 9 31.0
10 2017-06-05 orange 2 20.0
11 2017-06-06 orange 11 22.0
12 2017-06-07 orange 6 19.0
13 2017-06-08 orange 3 20.0
14 2017-06-09 orange 3 12.0
15 2017-06-10 orange 13 19.0
16 2017-06-02 grape 14 NaN
17 2017-06-03 grape 16 30.0
18 2017-06-07 grape 4 4.0
19 2017-06-09 grape 15 19.0
20 2017-06-10 grape 5 20.0
Upvotes: 8
Reputation: 117337
you can do it like this:
>>> df
>>>
fruit amount
20140101 apple 3
20140102 apple 5
20140102 orange 10
20140104 banana 2
20140104 apple 10
20140104 orange 4
20140105 orange 6
20140105 grape 1
>>> g= df.set_index('fruit', append=True).groupby(level=1)
>>> res = g['amount'].apply(pd.rolling_mean, 3, 1).reset_index('fruit')
>>> res
fruit 0
20140101 apple 3.000000
20140102 apple 4.000000
20140102 orange 10.000000
20140104 banana 2.000000
20140104 apple 6.000000
20140104 orange 7.000000
20140105 orange 6.666667
20140105 grape 1.000000
update
Well, as @cphlewis mentioned in comments, my code will not give the results you want. I've checked different approaches and the one I found so far is something like this (not sure about performance, though):
>>> df.index = [pd.to_datetime(str(x), format='%Y%m%d') for x in df.index]
>>> df.reset_index(inplace=True)
>>> def avg_3_days(x):
return df[(df['index'] >= x['index'] - pd.DateOffset(3)) & (df['index'] < x['index']) & (df['fruit'] == x['fruit'])].amount.mean()
>>> df['res'] = df.apply(avg_3_days, axis=1)
>>> df
index fruit amount res
0 2014-01-01 apple 3 NaN
1 2014-01-02 apple 5 3
2 2014-01-02 orange 10 NaN
3 2014-01-04 banana 2 NaN
4 2014-01-04 apple 10 4
5 2014-01-04 orange 4 10
6 2014-01-05 orange 6 7
7 2014-01-05 grape 1 NaN
Upvotes: 3