How to interpolate missing values in Pandas depending upon various conditional approaches

Question

I have a dataframe like the one shown below:

import pandas as pd, numpy as np

df = pd.DataFrame({'Var1':[1.1, 2.2, 3.5, np.NaN, np.NaN, np.NaN, 6.2, 7.9, 8.3, 9.6, 10.7, np.NaN, np.NaN, 11.1, 13.2, 14.7, np.NaN, 15.6, 18.3],
'Var2':[3.1, 6.2, 2.1, 3.6, np.NaN, 6.2, 13.2, 14.7, 7.9, 8.1, np.NaN, 9.2, 10.3, np.NaN, np.NaN,13.5, 14.2]})

For NANs I want to interpolate the missing data. And my strategy for such interpolation is that:

when there is only a single NA, use immediately before and after values, take their MEAN and replace NA with that MEAN
When there are two consecutive NAs, replace both with df['varX'].interpolate(method='polynomial', order=2)
when there are more than two consecutive NAs, replace NAs with df['varX'].interpolate(method='polynomial', order=3/2)

I thought on this: How can it be done?

I found this post a bit helpful, but it is not the one which could help me: Identifying consecutive NaNs with Pandas

jezrael · Accepted Answer

Use:

df = pd.DataFrame({
'Var1':[1.1, 2.2, 3.5, np.NaN, np.NaN, np.NaN, 6.2, 7.9, 8.3, 9.6, 10.7, np.NaN, np.NaN, 11.1, 13.2, 14.7, np.NaN],
'Var2':[3.1, 6.2, 2.1, 3.6, np.NaN, 6.2, 13.2, 14.7, 7.9, 8.1, np.NaN, 9.2, 10.3, np.NaN, np.NaN,13.5, 14.2]})

#count consecutive NaNs
f = lambda x: x.map(x.value_counts())
df1 = df.notna().cumsum().where(df.isna()).apply(f)
print (df1)
    Var1  Var2
0    NaN   NaN
1    NaN   NaN
2    NaN   NaN
3    3.0   NaN
4    3.0   1.0
5    3.0   NaN
6    NaN   NaN
7    NaN   NaN
8    NaN   NaN
9    NaN   NaN
10   NaN   1.0
11   2.0   NaN
12   2.0   NaN
13   NaN   2.0
14   NaN   2.0
15   NaN   NaN
16   1.0   NaN

#compare
m1 = df1.eq(1)
m2 = df1.eq(2)
m3 = df1.gt(2)

#interplations
df11 = df.ffill().add(df.bfill()).div(2)
#if need last/first NaNs replace by previous/last divide 2
#df11 = df.ffill().add(df.bfill(), fill_value=0).div(2)
df22 = df.interpolate(method='polynomial', order=2)

#order need int
df33 = df.interpolate(method='polynomial', order=3)

#set values by np.select
arr = np.select([m1, m2, m3], [df11, df22, df33], default=df)
df = pd.DataFrame(arr, index=df.index, columns=df.columns)

print (df)
         Var1       Var2
0    1.100000   3.100000
1    2.200000   6.200000
2    3.500000   2.100000
3    4.175679   3.600000
4    4.492391   4.900000
5    4.987908   6.200000
6    6.200000  13.200000
7    7.900000  14.700000
8    8.300000   7.900000
9    9.600000   8.100000
10  10.700000   8.650000
11  10.857431   9.200000
12  10.323495  10.300000
13  11.100000  11.467171
14  13.200000  12.581168
15  14.700000  13.500000
16        NaN  14.200000

#if need last/first replace by previus 3 values
s1 = df.tail(3).mean().rename(df.index[-1])
s2 = df.head(3).mean().rename(df.index[0])

df = df.fillna(s1).fillna(s2)
print (df)
         Var1       Var2
0    1.100000   3.100000
1    2.200000   6.200000
2    3.500000   2.100000
3    4.175679   3.600000
4    4.492391   4.900000
5    4.987908   6.200000
6    6.200000  13.200000
7    7.900000  14.700000
8    8.300000   7.900000
9    9.600000   8.100000
10  10.700000   8.650000
11  10.857431   9.200000
12  10.323495  10.300000
13  11.100000  11.467171
14  13.200000  12.581168
15  14.700000  13.500000
16  13.950000  14.200000

How to interpolate missing values in Pandas depending upon various conditional approaches

Answers (1)

Related Questions