Reputation: 783
I have a dataframe like the one shown below:
import pandas as pd, numpy as np
df = pd.DataFrame({'Var1':[1.1, 2.2, 3.5, np.NaN, np.NaN, np.NaN, 6.2, 7.9, 8.3, 9.6, 10.7, np.NaN, np.NaN, 11.1, 13.2, 14.7, np.NaN, 15.6, 18.3],
'Var2':[3.1, 6.2, 2.1, 3.6, np.NaN, 6.2, 13.2, 14.7, 7.9, 8.1, np.NaN, 9.2, 10.3, np.NaN, np.NaN,13.5, 14.2]})
For NANs I want to interpolate the missing data. And my strategy for such interpolation is that:
I thought on this: How can it be done?
I found this post a bit helpful, but it is not the one which could help me: Identifying consecutive NaNs with Pandas
Upvotes: 1
Views: 1406
Reputation: 862511
Use:
df = pd.DataFrame({
'Var1':[1.1, 2.2, 3.5, np.NaN, np.NaN, np.NaN, 6.2, 7.9, 8.3, 9.6, 10.7, np.NaN, np.NaN, 11.1, 13.2, 14.7, np.NaN],
'Var2':[3.1, 6.2, 2.1, 3.6, np.NaN, 6.2, 13.2, 14.7, 7.9, 8.1, np.NaN, 9.2, 10.3, np.NaN, np.NaN,13.5, 14.2]})
#count consecutive NaNs
f = lambda x: x.map(x.value_counts())
df1 = df.notna().cumsum().where(df.isna()).apply(f)
print (df1)
Var1 Var2
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 3.0 NaN
4 3.0 1.0
5 3.0 NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 NaN 1.0
11 2.0 NaN
12 2.0 NaN
13 NaN 2.0
14 NaN 2.0
15 NaN NaN
16 1.0 NaN
#compare
m1 = df1.eq(1)
m2 = df1.eq(2)
m3 = df1.gt(2)
#interplations
df11 = df.ffill().add(df.bfill()).div(2)
#if need last/first NaNs replace by previous/last divide 2
#df11 = df.ffill().add(df.bfill(), fill_value=0).div(2)
df22 = df.interpolate(method='polynomial', order=2)
#order need int
df33 = df.interpolate(method='polynomial', order=3)
#set values by np.select
arr = np.select([m1, m2, m3], [df11, df22, df33], default=df)
df = pd.DataFrame(arr, index=df.index, columns=df.columns)
print (df)
Var1 Var2
0 1.100000 3.100000
1 2.200000 6.200000
2 3.500000 2.100000
3 4.175679 3.600000
4 4.492391 4.900000
5 4.987908 6.200000
6 6.200000 13.200000
7 7.900000 14.700000
8 8.300000 7.900000
9 9.600000 8.100000
10 10.700000 8.650000
11 10.857431 9.200000
12 10.323495 10.300000
13 11.100000 11.467171
14 13.200000 12.581168
15 14.700000 13.500000
16 NaN 14.200000
#if need last/first replace by previus 3 values
s1 = df.tail(3).mean().rename(df.index[-1])
s2 = df.head(3).mean().rename(df.index[0])
df = df.fillna(s1).fillna(s2)
print (df)
Var1 Var2
0 1.100000 3.100000
1 2.200000 6.200000
2 3.500000 2.100000
3 4.175679 3.600000
4 4.492391 4.900000
5 4.987908 6.200000
6 6.200000 13.200000
7 7.900000 14.700000
8 8.300000 7.900000
9 9.600000 8.100000
10 10.700000 8.650000
11 10.857431 9.200000
12 10.323495 10.300000
13 11.100000 11.467171
14 13.200000 12.581168
15 14.700000 13.500000
16 13.950000 14.200000
Upvotes: 1