Giovanni Zanatta
Giovanni Zanatta

Reputation: 27

Fill NaN values wit mean of previous rows?

I have to fill the nan values of a column in a dataframe with the mean of the previous 3 instances. Here is the following example:

df = pd.DataFrame({'col1': [1, 3, 4, 5, np.NaN, np.NaN, np.NaN, 7]})
df
col1
0   1.0
1   3.0
2   4.0
3   5.0
4   NaN
5   NaN
6   NaN 
7   7.0

And here is the output I need:

col1
0   1.0
1   3.0
2   4.0
3   5.0
4   4.0
5   4.3
6   4.4 
7   7.0

I tried pd.rolling, but it does not work the way I want when the column has more than one NaN value in a roll:

df.fillna(df.rolling(3, min_periods=1).mean().shift())


col1
0   1.0
1   3.0
2   4.0
3   5.0
4   4.0 # np.nanmean([3, 4, 5])
5   4.5 # np.nanmean([np.NaN, 4, 5])
6   5.0 # np.nanmean([np.NaN, np.naN ,5])
7   7.0

Can someone help me with that? Thanks in advance!

Upvotes: 1

Views: 1668

Answers (2)

piterbarg
piterbarg

Reputation: 8219

Probably not the most efficient but terse and gets the job done

from functools import reduce
reduce(lambda d, _: d.fillna(d.rolling(3, min_periods=3).mean().shift()), range(df['col1'].isna().sum()), df)

output


    col1
0   1.000000
1   3.000000
2   4.000000
3   5.000000
4   4.000000
5   4.333333
6   4.444444
7   7.000000

we basically use fillna but require min_periods=3 meaning it will only fill a single NaN at a time, or rather those NaNs that have three non-NaN numbers immediately preceeding it. Then we use reduce to repeat this operation as many times as there are NaNs in col1

Upvotes: 3

Nick ODell
Nick ODell

Reputation: 25210

I tried two approaches to this problem. One is a loop over the dataframe, and the second is essentially trying the approach you suggest multiple times, to converge on the right answer.

Loop approach

For each row in the dataframe, get the value from col1. Then, take the average of the last rows. (There can be less than 3 in this list, if we're at the beginning of the dataframe.) If the value is NaN, replace it with the average value. Then, save the value back into the dataframe. If the list of values from the last rows has more than 3 values, then remove the last one.

def impute(df2, col_name):
    last_3 = []
    for index in df.index:
        val = df2.loc[index, col_name]
        if len(last_3) > 0:
            imputed = np.nanmean(last_3)
        else:
            imputed = None
        if np.isnan(val):
            val = imputed
        last_3.append(val)
        df2.loc[index, col_name] = val
        if len(last_3) > 3:
            last_3.pop(0)

Repeated column operation

The core idea here is to notice that in your example of pd.rolling, the first NA replacement value is correct. So, you apply the rolling average, take the first NA value for each run of NA values, and use that number. If you apply this repeatedly, you fill in the first missing value, then the second missing value, then the third. You'll need to run this loop as many times as the longest series of consecutive NA values.

def impute(df2, col_name):
    while df2[col_name].isna().any().any():
        # If there are multiple NA values in a row, identify just
        # the first one
        first_na = df2[col_name].isna().diff() & df2[col_name].isna()
        # Compute mean of previous 3 values
        imputed = df2.rolling(3, min_periods=1).mean().shift()[col_name]
        # Replace NA values with mean if they are very first NA
        # value in run of NA values
        df2.loc[first_na, col_name] = imputed

Performance comparison

Running both of these on an 80000 row dataframe, I get the following results:

Loop approach takes 20.744 seconds
Repeated column operation takes 0.056 seconds

Upvotes: 1

Related Questions