Reputation: 23
I am looking for a way to linear interpolate missing values (NaN) from zero to the next valid value.
E.g.:
A B C D E
0 NaN 2.0 NaN NaN 0
1 3.0 4.0 NaN NaN 1
2 NaN NaN NaN NaN 5
3 NaN 3.0 NaN NaN 4
Given this table, i want the output to look like this:
A B C D E
0 NaN 2.0 0 0 0
1 3.0 4.0 0 0.5 1
2 NaN NaN NaN NaN 5
3 NaN 3.0 0 2 4
I've tried using fillna to fill only the next NaN to a valid value to 0 and to then linear interpolate the whole dataframe. The problem I'm facing here is that specifying a value and a limit with fillna won't affect consecutive NaNs, but limit the total amount of columns to be filled.
If possible please only suggest solutions without iterating over each row manually since I'm working with large dataframes.
Thanks in advance.
Upvotes: 2
Views: 2039
Reputation: 59579
Here's a method that will work to replace 0 for the first NaN
after a valid number and then will interpolate row-wise. I added extra rows in the end to illustrate the behavior for multiple fillings on the same row, fillings of only one value, or rows that end in NaN streaks.
A B C D E
0 NaN 2.0 NaN NaN 0
1 3.0 4.0 NaN NaN 1
2 NaN NaN NaN NaN 5
3 NaN 3.0 NaN NaN 4
4 3 NaN 7 NaN 5
5 NaN 4 7 NaN 6
6 NaN 4 7 NaN NaN
7 5 NaN 5 NaN NaN
m = (df.notnull().cummax(axis=1) & df.isnull()).astype(int).diff(axis=1).fillna(0)
update = m.where(m.eq(1) & m.loc[:, ::-1].cummin(axis=1).eq(-1)).replace(1, 0)
df.update(update) # Add in 0s
df = df.interpolate(axis=1, limit_area='inside')
A B C D E
0 NaN 2.0 0.0 0.0 0.0
1 3.0 4.0 0.0 0.5 1.0
2 NaN NaN NaN NaN 5.0
3 NaN 3.0 0.0 2.0 4.0
4 3.0 0.0 7.0 0.0 5.0
5 NaN 4.0 7.0 0.0 6.0
6 NaN 4.0 7.0 NaN NaN
7 5.0 0.0 5.0 NaN NaN
How it works:
(df.notnull().cummax(1) & df.isnull()) # True for streaks of null after non-null
# A B C D E
#0 False False True True False
#1 False False True True False
#2 False False False False False
#3 False False True True False
#4 False True False True False
#5 False False False True False
#6 False False False True True
#7 False True False True True
# Taking the diff then allows you to find only the first NaN after any non-null.
# I.e. flagged by `1`
(df.notnull().cummax(1) & df.isnull()).astype(int).diff(axis=1).fillna(0)
# A B C D E
#0 0.0 0.0 1.0 0.0 -1.0
#1 0.0 0.0 1.0 0.0 -1.0
#2 0.0 0.0 0.0 0.0 0.0
#3 0.0 0.0 1.0 0.0 -1.0
#4 0.0 1.0 -1.0 1.0 -1.0
#5 0.0 0.0 0.0 1.0 -1.0
#6 0.0 0.0 0.0 1.0 0.0
#7 0.0 1.0 -1.0 1.0 0.0
# The update DataFrame is a like-indexed DF with 0s where they get filled.
# The reversed cummin ensures fills only if there's a non-null value after the 0.
m.where(m.eq(1) & m.loc[:, ::-1].cummin(1).eq(-1)).replace(1, 0)
# A B C D E
#0 NaN NaN 0.0 NaN NaN
#1 NaN NaN 0.0 NaN NaN
#2 NaN NaN NaN NaN NaN
#3 NaN NaN 0.0 NaN NaN
#4 NaN 0.0 NaN 0.0 NaN
#5 NaN NaN NaN 0.0 NaN
#6 NaN NaN NaN NaN NaN
#7 NaN 0.0 NaN NaN NaN
Upvotes: 3