Little Bobby Tables
Little Bobby Tables

Reputation: 4742

Pandas ffill limit groups of nan less than limit only

I wish to forward fill some values in my pandas dataframe, with a limit. However, the limit needs to only fill groups of nans where the continuous nan count is less than or equal to the limit. Here is an example,

Create a df with missing data,

import numpy as np
import pandas as pd

df = pd.DataFrame(
        {'val': [1, 1, np.nan, np.nan, 2, 3, np.nan, np.nan, np.nan, np.nan, 1, 1]}
)

print(df)

    val
0     1.0
1     1.0
2     NaN
3     NaN
4     2.0
5     3.0
6     NaN
7     NaN
8     NaN
9     NaN
10    1.0
11    1.0

Now if we ffill it fills all nans up to 2 steps forwards,

print(df.ffill(limit=2))

    val
0     1.0
1     1.0
2     1.0
3     1.0
4     2.0
5     3.0
6     3.0 #
7     3.0 #
8     NaN
9     NaN
10    1.0
11    1.0

Those reads (above) with # fill when I do not want them to. What I would like is to get the following,

print(df.ffill(limit=2, dont_fill_any_nan_gaps_bigger_than_limit=True))

    val
0     1.0
1     1.0
2     1.0
3     1.0
4     2.0
5     3.0
6     NaN # 
7     NaN #
8     NaN 
9     NaN 
10    1.0
11    1.0

Of course this does not need to be a single parameter in ffill but something with the same effect.

Upvotes: 3

Views: 1595

Answers (3)

Little Bobby Tables
Little Bobby Tables

Reputation: 4742

I have take inspiration from both excellent answers as well as this answer from jezral to my previous question to apply this to the whole DataFrame.

I did this so that I could treat an entire DataFrame in one go and so that each column would ffill the missing values in their respective columns (with my extra constraint).

Setup a DataFrame,

df = pd.DataFrame(
    {'val1': [1, 1, np.nan, np.nan, 2, 3, np.nan, np.nan, np.nan, np.nan, 1, 1],
     'val2': [1, 2, np.nan, np.nan, 2, 4, 4, np.nan, np.nan, np.nan, np.nan, 2]}
)

print(df)

    val1    val2
0   1.0     1.0
1   1.0     2.0
2   NaN     NaN
3   NaN     NaN
4   2.0     2.0
5   3.0     4.0
6   NaN #   4.0
7   NaN #   NaN #
8   NaN #   NaN #
9   NaN #   NaN #
10  1.0     NaN #
11  1.0     2.0

Now do mask and ffill,

mask_df = (
    df.isnull()
      .groupby([df.notnull().all(axis=1).cumsum()])
      .rank(method='max') - 1
).gt(2)

df = df.ffill().mask(mask_df)
print(df)

    val1    val2
0   1.0     1.0
1   1.0     2.0
2   1.0     2.0
3   1.0     2.0
4   2.0     2.0
5   3.0     4.0
6   NaN #   4.0
7   NaN #   NaN #
8   NaN #   NaN #
9   NaN #   NaN #
10  1.0     NaN #
11  1.0     2.0

Explaination

We groupby using a cumulative sum over the non-nan values. This means the nan values become grouped together. If we take the max rank of these we get the length of the nan sequence + 1. Now we just use the mask function as shown by cᴏʟᴅsᴘᴇᴇᴅ's answer.

Upvotes: 4

jezrael
jezrael

Reputation: 863351

Create mask for filter out all rows with NaNs with size higher as 2 by groupby and transform size and apply ffill only for filtered rows with inverting condition by ~:

a = df['value'].isna()
a = a.ne(a.shift()).cumsum()
m = (a.groupby(a).transform('size') > 2)
df[~m] = df[~m].ffill(limit=2)
print (df)
    value
0     1.0
1     1.0
2     1.0
3     1.0
4     2.0
5     3.0
6     NaN
7     NaN
8     NaN
9     NaN
10    1.0
11    1.0

Upvotes: 2

cs95
cs95

Reputation: 402942

You may ffill unconditionally, provided you un-fill after using pd.Series.mask/pd.Series.where/np.where.

v = df.value.isna()  # df.value.isnull()
df = df.ffill(limit=2).mask(
     v.groupby(v.ne(v.shift()).cumsum()).transform('size').gt(2)
)

    value
0     1.0
1     1.0
2     1.0
3     1.0
4     2.0
5     3.0
6     NaN
7     NaN
8     NaN
9     NaN
10    1.0
11    1.0

Upvotes: 2

Related Questions