Galaffer
Galaffer

Reputation: 171

Average of up to X values, in turn averaged with the last value?

I have time series data from many individuals, where the last entry is rather important for the next entry - but not important enough to stand alone as a predictor for the next entry.

Therefore, I'd like to do this for each entry:

(1) Calculate the average of up to 5 previous entries within the same individual (may be fewer entries, as in my example df)

(2) Grab the last entry of the same individual

(3) Calculate an average of (1) and (2) (equally weighted) as my predicted value for the next entry

My data looks like so:

df = pd.DataFrame({'date':[
    '01.01.2020','02.01.2020','03.01.2020','10.01.2020', 
    '01.01.2020','04.02.2020','20.02.2020','21.02.2020', 
    '01.02.2020','10.02.2020','20.02.2020','20.03.2020'],
    'user':[1,1,1,1,2,2,2,2,3,3,3,3],
    'days_until_next_event':[1,1,7,NaN,3,16,1,NaN,9,10,29,NaN]})

The column I'd like to predict is 'days_until_next_event', so for instance for user 3, I'd have the following data:1,9,10,29. The calculation would be: (((1+9+10+29)/4) + 29)/2, with the result of 20.625.

I have a hard time structuring the solution well (beginner!).. Here's where I am at right now:

def weightavg(df, group, num_avg):
  average = df.groupby('user').mean()
  latest = ? # Here I'd need to grab the value from the prior row?
  prediction = (average + latest) / 2 

Expected output for user 1:

date         user        days_until_next_event     predicted
01.01.2020   1           1                         NaN
02.01.2020   1           1                         1
03.01.2020   1           7                         1
10.01.2020   1           NaN                       5

The 5 in the last row would come from the average of all prior values (1+1+7)/3 = 3, and from the last value 7, averaged. So: (((1+1+7)/3) + 7)/2

Upvotes: 1

Views: 73

Answers (1)

jezrael
jezrael

Reputation: 862791

I believe you need Series.rolling per groups and add mean with last value, last divide by 2:

g = (df.assign(new = df['days_until_next_event'])
       .set_index(['date','days_until_next_event'])
       .groupby('user')['new']
       .rolling(5, min_periods=1))

df1 = (g.mean().add(g.apply(lambda x: x.iat[-1])).div(2)
        .groupby(level=0)
        .shift()
        .reset_index(name='predicted'))
    
print (df1)
    user        date  days_until_next_event  predicted
0      1  01.01.2020                    1.0        NaN
1      1  02.01.2020                    1.0   1.000000
2      1  03.01.2020                    7.0   1.000000
3      1  10.01.2020                    NaN   5.000000
4      2  01.01.2020                    3.0        NaN
5      2  04.02.2020                   16.0   3.000000
6      2  20.02.2020                    1.0  12.750000
7      2  21.02.2020                    NaN   3.833333
8      3  01.02.2020                    9.0        NaN
9      3  10.02.2020                   10.0   9.000000
10     3  20.02.2020                   29.0   9.750000
11     3  20.03.2020                    NaN  22.500000

Upvotes: 1

Related Questions