Reputation: 27
I have to fill the nan values of a column in a dataframe with the mean of the previous 3 instances. Here is the following example:
df = pd.DataFrame({'col1': [1, 3, 4, 5, np.NaN, np.NaN, np.NaN, 7]})
df
col1
0 1.0
1 3.0
2 4.0
3 5.0
4 NaN
5 NaN
6 NaN
7 7.0
And here is the output I need:
col1
0 1.0
1 3.0
2 4.0
3 5.0
4 4.0
5 4.3
6 4.4
7 7.0
I tried pd.rolling, but it does not work the way I want when the column has more than one NaN value in a roll:
df.fillna(df.rolling(3, min_periods=1).mean().shift())
col1
0 1.0
1 3.0
2 4.0
3 5.0
4 4.0 # np.nanmean([3, 4, 5])
5 4.5 # np.nanmean([np.NaN, 4, 5])
6 5.0 # np.nanmean([np.NaN, np.naN ,5])
7 7.0
Can someone help me with that? Thanks in advance!
Upvotes: 1
Views: 1668
Reputation: 8219
Probably not the most efficient but terse and gets the job done
from functools import reduce
reduce(lambda d, _: d.fillna(d.rolling(3, min_periods=3).mean().shift()), range(df['col1'].isna().sum()), df)
output
col1
0 1.000000
1 3.000000
2 4.000000
3 5.000000
4 4.000000
5 4.333333
6 4.444444
7 7.000000
we basically use fillna
but require min_periods=3
meaning it will only fill a single NaN at a time, or rather those NaNs that have three non-NaN numbers immediately preceeding it. Then we use reduce
to repeat this operation as many times as there are NaNs in col1
Upvotes: 3
Reputation: 25210
I tried two approaches to this problem. One is a loop over the dataframe, and the second is essentially trying the approach you suggest multiple times, to converge on the right answer.
For each row in the dataframe, get the value from col1. Then, take the average of the last rows. (There can be less than 3 in this list, if we're at the beginning of the dataframe.) If the value is NaN, replace it with the average value. Then, save the value back into the dataframe. If the list of values from the last rows has more than 3 values, then remove the last one.
def impute(df2, col_name):
last_3 = []
for index in df.index:
val = df2.loc[index, col_name]
if len(last_3) > 0:
imputed = np.nanmean(last_3)
else:
imputed = None
if np.isnan(val):
val = imputed
last_3.append(val)
df2.loc[index, col_name] = val
if len(last_3) > 3:
last_3.pop(0)
The core idea here is to notice that in your example of pd.rolling, the first NA replacement value is correct. So, you apply the rolling average, take the first NA value for each run of NA values, and use that number. If you apply this repeatedly, you fill in the first missing value, then the second missing value, then the third. You'll need to run this loop as many times as the longest series of consecutive NA values.
def impute(df2, col_name):
while df2[col_name].isna().any().any():
# If there are multiple NA values in a row, identify just
# the first one
first_na = df2[col_name].isna().diff() & df2[col_name].isna()
# Compute mean of previous 3 values
imputed = df2.rolling(3, min_periods=1).mean().shift()[col_name]
# Replace NA values with mean if they are very first NA
# value in run of NA values
df2.loc[first_na, col_name] = imputed
Running both of these on an 80000 row dataframe, I get the following results:
Loop approach takes 20.744 seconds
Repeated column operation takes 0.056 seconds
Upvotes: 1