NeStack
NeStack

Reputation: 2014

Fill nan gaps in pandas df only if gaps smaller than N nans

I am working with a pandas data frame that contains also nan values. I want to substitute the nans with interpolated values with df.interpolate, but only if the length of the sequence of nan values is =<N. As an example, let's assume that I choose N = 2 (so I want to fill in sequences of nans if they are up to 2 nans long) and I have a dataframe with

print(df)
A   B   C
1   1   1
nan nan 2
nan nan 3
nan 4   nan
5   5   5

In such a case I want to apply a function on df that only the nan sequences with length N<=2 get filled, but the larger sequences get untouched, resulting in my desired output of

print(df)
A   B   C
1   1   1
nan 2   2
nan 3   3
nan 4   4
5   5   5

Note that I am aware of the option of limit=N inside df.interpolate, but it doesn't fulfil what I want, because it would fill any length of nan sequence, just limit the filling to a the first 3 nans resulting in the undesired output

print(df)
A   B   C
1   1   1
2   2   2
3   3   3
nan 4   4
5   5   5

So do you know of a function/ do you know how to construct a code that results in my desired output? Tnx

Upvotes: 2

Views: 1130

Answers (3)

d.b
d.b

Reputation: 32548

You can perform run length encoding and identify the runs of NaN that are shorter than or equal to two elements for each columns. One way to do that is to use get_id from package pdrle (disclaimer: I wrote it).

import pdrle


chk = df.isna() & (df.apply(lambda x: x.groupby(pdrle.get_id(x)).transform(len)) <= 2)
df[chk] = df.interpolate()[chk]
#      A    B    C
# 0  1.0  1.0  1.0
# 1  NaN  2.0  2.0
# 2  NaN  3.0  3.0
# 3  NaN  4.0  4.0
# 4  5.0  5.0  5.0

Upvotes: 3

Akshay Sehgal
Akshay Sehgal

Reputation: 19322

You can try the following -

n=2
cols = df.columns[df.isna().sum()<=n]
df[cols]  = df[cols].interpolate()
df
     A    B    C
0  1.0  1.0  1.0
1  NaN  2.0  2.0
2  NaN  3.0  3.0
3  NaN  4.0  4.0
4  5.0  5.0  5.0

df.columns[df.isna().sum()<=n] filters the columns based on your condition. Then, you simply overwrite the columns after interpolation.

Upvotes: 0

Andrej Kesely
Andrej Kesely

Reputation: 195438

Try:

N = 2
df_interpolated = df.interpolate()

for c in df:
    mask = df[c].isna()
    x = (
        mask.groupby((mask != mask.shift()).cumsum()).transform(
            lambda x: len(x) > N
        )
        * mask
    )
    df_interpolated[c] = df_interpolated.loc[~x, c]

print(df_interpolated)

Prints:

     A    B    C
0  1.0  1.0  1.0
1  NaN  2.0  2.0
2  NaN  3.0  3.0
3  NaN  4.0  4.0
4  5.0  5.0  5.0

Trying with different df:

     A    B    C
0  1.0  1.0  1.0
1  NaN  NaN  2.0
2  NaN  NaN  3.0
3  NaN  4.0  NaN
4  5.0  5.0  5.0
5  NaN  5.0  NaN
6  NaN  5.0  NaN
7  8.0  5.0  NaN

produces:

     A    B    C
0  1.0  1.0  1.0
1  NaN  2.0  2.0
2  NaN  3.0  3.0
3  NaN  4.0  4.0
4  5.0  5.0  5.0
5  6.0  5.0  NaN
6  7.0  5.0  NaN
7  8.0  5.0  NaN

Upvotes: 2

Related Questions