Thanasis
Thanasis

Reputation: 725

Forward filling NA with average value in Pandas Dataframes

I have the following dataframe and I want to fill in the forward NaN cells with the last available value divided by the number of the NaN+1:

enter image description here

Update: I would like to iterate to the last x columns and do that manipulations, the columns.values is a tuple.( ie it has multiple rows).

It should be similar to the DataFrame.fillna(value=None, method=ffill), but not with the last available value though.

The final dataframe should look like that:

enter image description here

Thanks for your help in advance.

Upvotes: 2

Views: 995

Answers (2)

jpp
jpp

Reputation: 164633

You can use GroupBy + transform with 'first' and 'size'. This is similar to @sacul's solution, but avoids the lambda which limits the benefits of vectorisation:

g = df.groupby(df['B'].notnull().cumsum())['B']
df['B'] = g.transform('first') / g.transform('size')

Performance benchmarking

There's a 1,500x performance differential for a large dataframe:

# Python 3.6.0, Pandas 0.19.2

np.random.seed(0)
df = pd.DataFrame({'A':np.random.random(10),'B':[8,np.nan,np.nan,np.nan,np.nan,7,np.nan,np.nan,np.nan,np.nan]})

def jpp(df):
    g = df.groupby(df['B'].notnull().cumsum())['B']
    df['B'] = g.transform('first') / g.transform('size')
    return df

def sac(df):
    df['B'] = df.groupby(df.B.notnull().cumsum(),as_index=False)['B']\
                .transform(lambda x: x.iloc[0]/x.size)
    return df

df = pd.concat([df]*1000)

assert jpp(df).equals(sac(df))

%timeit jpp(df)  # 5.07 ms per loop
%timeit sac(df)  # 7.84 s per loop

Upvotes: 3

sacuL
sacuL

Reputation: 51335

You could use a groupby instead of ffill: use the cumulative sum of the boolean resulting from notnull as your grouping variable, then transform B by taking the first value and dividing it by the size of each group:

 df = pd.DataFrame({'A':np.random.random(10),'B':[8,np.nan,np.nan,np.nan,np.nan,7,np.nan,np.nan,np.nan,np.nan]})
>>> df
          A    B
0  0.899200  8.0
1  0.011443  NaN
2  0.227406  NaN
3  0.602941  NaN
4  0.214716  NaN
5  0.534166  7.0
6  0.519983  NaN
7  0.273051  NaN
8  0.454338  NaN
9  0.537210  NaN

df['B'] = (df.groupby(df.B.notnull().cumsum(),as_index=False)['B']
           .transform(lambda x: x.iloc[0]/x.size))

>>> df
          A    B
0  0.899200  1.6
1  0.011443  1.6
2  0.227406  1.6
3  0.602941  1.6
4  0.214716  1.6
5  0.534166  1.4
6  0.519983  1.4
7  0.273051  1.4
8  0.454338  1.4
9  0.537210  1.4

Upvotes: 1

Related Questions