abatra
abatra

Reputation: 11

Fill Nans with rolling mean which is a combination of actual data + calculated rolling mean

I have tried to find the solution to this but havent been able to. So asking. I have a dataset for which I wish to forecast values for future dates for groups within the dataset. The goal would be to fill NANs with rolling mean average of the last 5 days with actuals where actual available and rolling mean where actual is not available.

Sample Data

enter image description here

 VALUE  EXPECTED
   5.0         5
  10.0        10
  15.0        15
  20.0        20
  25.0        25
   NaN        15
  50.0        50
   NaN        25
   NaN        27

Here is the code that I used to try to get the expected value but end with something weird:

df_grouped_index['RETENTION_FCST_IMPUTED'] = (
    df_grouped_index
    .sort_values(['INSTALLMENT_KEY', 'PLATFORM_SDESC', 'RELATIVE_DAY_KEY', 'DAY_KEY'])
    .groupby(['INSTALLMENT_KEY', 'PLATFORM_SDESC', 'RELATIVE_DAY_KEY'], group_keys=False)
    .apply(lambda x: (
        x['RETENTION_CALCULATED']
        #.fillna(method='ffill')  # Forward fill within each group to avoid NaNs in the rolling mean
        .rolling(6, min_periods=1, win_type = None , method= 'single') 
        .mean()
        .where(x['RETENTION_CALCULATED'].isnull())  # Only apply to original NaNs
        .combine_first(x['RETENTION_CALCULATED'])
    ))
)

enter image description here

Upvotes: 1

Views: 39

Answers (1)

mozway
mozway

Reputation: 262194

If you want to use the last 5 values (ignoring NaNs), combine rolling.mean, shift and fillna:

df['out'] = df['VALUE'].fillna(df['VALUE'].rolling(5, min_periods=1).mean().shift())

This gives however slightly different values that in your example.

Output:

   VALUE  EXPECTED        out
0    5.0         5   5.000000
1   10.0        10  10.000000
2   15.0        15  15.000000
3   20.0        20  20.000000
4   25.0        25  25.000000
5    NaN        15  15.000000  # (5+10+15+20+25)/5
6   50.0        50  50.000000
7    NaN        25  27.500000  # (15+20+25+50)/4
8    NaN        27  31.666667  # (20+25+50)/3

If you want to include the previously filled values in the computation, you can't vectorize and would need to use numba:

from numba import jit

@jit(nopython=True)
def fill_avg(a, window=5):
    out = a.copy()
    for i, x in enumerate(a):
        if np.isnan(x):
            out[i] = np.mean(out[max(0, i-window):i])
    return out

df['out'] = fill_avg(df['VALUE'].to_numpy())

Output:

   VALUE  EXPECTED   out
0    5.0         5   5.0
1   10.0        10  10.0
2   15.0        15  15.0
3   20.0        20  20.0
4   25.0        25  25.0
5    NaN        15  15.0  # (5+10+15+20+25)/5
6   50.0        50  50.0
7    NaN        25  25.0  # (15+20+25+15+50)/5
8    NaN        27  27.0  # (20+25+15+50+25)/5

Upvotes: 1

Related Questions