Mooncake
Mooncake

Reputation: 23

Pandas interpolate NaNs from zero to next valid value

I am looking for a way to linear interpolate missing values (NaN) from zero to the next valid value.

E.g.:

     A    B   C   D  E
0  NaN  2.0 NaN NaN  0
1  3.0  4.0 NaN NaN  1
2  NaN  NaN NaN NaN  5
3  NaN  3.0 NaN NaN  4

Given this table, i want the output to look like this:

     A    B   C   D  E
0  NaN  2.0   0   0  0
1  3.0  4.0   0 0.5  1
2  NaN  NaN NaN NaN  5
3  NaN  3.0   0   2  4

I've tried using fillna to fill only the next NaN to a valid value to 0 and to then linear interpolate the whole dataframe. The problem I'm facing here is that specifying a value and a limit with fillna won't affect consecutive NaNs, but limit the total amount of columns to be filled.

If possible please only suggest solutions without iterating over each row manually since I'm working with large dataframes.

Thanks in advance.

Upvotes: 2

Views: 2039

Answers (1)

ALollz
ALollz

Reputation: 59579

Here's a method that will work to replace 0 for the first NaN after a valid number and then will interpolate row-wise. I added extra rows in the end to illustrate the behavior for multiple fillings on the same row, fillings of only one value, or rows that end in NaN streaks.

Sample Data

     A    B   C   D  E
0  NaN  2.0 NaN NaN  0
1  3.0  4.0 NaN NaN  1
2  NaN  NaN NaN NaN  5
3  NaN  3.0 NaN NaN  4
4  3   NaN  7  NaN   5
5  NaN  4   7  NaN   6
6  NaN  4   7  NaN  NaN
7  5   NaN  5  NaN  NaN

Code

m = (df.notnull().cummax(axis=1) & df.isnull()).astype(int).diff(axis=1).fillna(0)
update = m.where(m.eq(1) & m.loc[:, ::-1].cummin(axis=1).eq(-1)).replace(1, 0)

df.update(update)  # Add in 0s

df = df.interpolate(axis=1, limit_area='inside')

     A    B    C    D    E
0  NaN  2.0  0.0  0.0  0.0
1  3.0  4.0  0.0  0.5  1.0
2  NaN  NaN  NaN  NaN  5.0
3  NaN  3.0  0.0  2.0  4.0
4  3.0  0.0  7.0  0.0  5.0
5  NaN  4.0  7.0  0.0  6.0
6  NaN  4.0  7.0  NaN  NaN
7  5.0  0.0  5.0  NaN  NaN

How it works:

(df.notnull().cummax(1) & df.isnull())  # True for streaks of null after non-null
#       A      B      C      D      E
#0  False  False   True   True  False
#1  False  False   True   True  False
#2  False  False  False  False  False
#3  False  False   True   True  False
#4  False   True  False   True  False
#5  False  False  False   True  False
#6  False  False  False   True   True
#7  False   True  False   True   True

# Taking the diff then allows you to find only the first NaN after any non-null.
# I.e. flagged by `1`
(df.notnull().cummax(1) & df.isnull()).astype(int).diff(axis=1).fillna(0)
#     A    B    C    D    E
#0  0.0  0.0  1.0  0.0 -1.0
#1  0.0  0.0  1.0  0.0 -1.0
#2  0.0  0.0  0.0  0.0  0.0
#3  0.0  0.0  1.0  0.0 -1.0
#4  0.0  1.0 -1.0  1.0 -1.0
#5  0.0  0.0  0.0  1.0 -1.0
#6  0.0  0.0  0.0  1.0  0.0
#7  0.0  1.0 -1.0  1.0  0.0

# The update DataFrame is a like-indexed DF with 0s where they get filled.
# The reversed cummin ensures fills only if there's a non-null value after the 0.
m.where(m.eq(1) & m.loc[:, ::-1].cummin(1).eq(-1)).replace(1, 0)
#    A    B    C    D   E
#0 NaN  NaN  0.0  NaN NaN
#1 NaN  NaN  0.0  NaN NaN
#2 NaN  NaN  NaN  NaN NaN
#3 NaN  NaN  0.0  NaN NaN
#4 NaN  0.0  NaN  0.0 NaN
#5 NaN  NaN  NaN  0.0 NaN
#6 NaN  NaN  NaN  NaN NaN
#7 NaN  0.0  NaN  NaN NaN

Upvotes: 3

Related Questions