Reputation: 4742
I wish to forward fill some values in my pandas dataframe, with a limit. However, the limit needs to only fill groups of nans where the continuous nan count is less than or equal to the limit. Here is an example,
Create a df with missing data,
import numpy as np
import pandas as pd
df = pd.DataFrame(
{'val': [1, 1, np.nan, np.nan, 2, 3, np.nan, np.nan, np.nan, np.nan, 1, 1]}
)
print(df)
val
0 1.0
1 1.0
2 NaN
3 NaN
4 2.0
5 3.0
6 NaN
7 NaN
8 NaN
9 NaN
10 1.0
11 1.0
Now if we ffill it fills all nans up to 2 steps forwards,
print(df.ffill(limit=2))
val
0 1.0
1 1.0
2 1.0
3 1.0
4 2.0
5 3.0
6 3.0 #
7 3.0 #
8 NaN
9 NaN
10 1.0
11 1.0
Those reads (above) with # fill when I do not want them to. What I would like is to get the following,
print(df.ffill(limit=2, dont_fill_any_nan_gaps_bigger_than_limit=True))
val
0 1.0
1 1.0
2 1.0
3 1.0
4 2.0
5 3.0
6 NaN #
7 NaN #
8 NaN
9 NaN
10 1.0
11 1.0
Of course this does not need to be a single parameter in ffill
but something with the same effect.
Upvotes: 3
Views: 1595
Reputation: 4742
I have take inspiration from both excellent answers as well as this answer from jezral to my previous question to apply this to the whole DataFrame.
I did this so that I could treat an entire DataFrame in one go and so that each column would ffill
the missing values in their respective columns (with my extra constraint).
Setup a DataFrame,
df = pd.DataFrame(
{'val1': [1, 1, np.nan, np.nan, 2, 3, np.nan, np.nan, np.nan, np.nan, 1, 1],
'val2': [1, 2, np.nan, np.nan, 2, 4, 4, np.nan, np.nan, np.nan, np.nan, 2]}
)
print(df)
val1 val2
0 1.0 1.0
1 1.0 2.0
2 NaN NaN
3 NaN NaN
4 2.0 2.0
5 3.0 4.0
6 NaN # 4.0
7 NaN # NaN #
8 NaN # NaN #
9 NaN # NaN #
10 1.0 NaN #
11 1.0 2.0
Now do mask
and ffill
,
mask_df = (
df.isnull()
.groupby([df.notnull().all(axis=1).cumsum()])
.rank(method='max') - 1
).gt(2)
df = df.ffill().mask(mask_df)
print(df)
val1 val2
0 1.0 1.0
1 1.0 2.0
2 1.0 2.0
3 1.0 2.0
4 2.0 2.0
5 3.0 4.0
6 NaN # 4.0
7 NaN # NaN #
8 NaN # NaN #
9 NaN # NaN #
10 1.0 NaN #
11 1.0 2.0
Explaination
We groupby
using a cumulative sum over the non-nan
values. This means the nan
values become grouped together. If we take the max rank
of these we get the length of the nan
sequence + 1. Now we just use the mask
function as shown by cᴏʟᴅsᴘᴇᴇᴅ's answer.
Upvotes: 4
Reputation: 863351
Create mask for filter out all rows with NaN
s with size higher as 2
by groupby
and transform
size
and apply ffill
only for filtered rows with inverting condition by ~
:
a = df['value'].isna()
a = a.ne(a.shift()).cumsum()
m = (a.groupby(a).transform('size') > 2)
df[~m] = df[~m].ffill(limit=2)
print (df)
value
0 1.0
1 1.0
2 1.0
3 1.0
4 2.0
5 3.0
6 NaN
7 NaN
8 NaN
9 NaN
10 1.0
11 1.0
Upvotes: 2
Reputation: 402942
You may ffill
unconditionally, provided you un-fill after using pd.Series.mask
/pd.Series.where
/np.where
.
v = df.value.isna() # df.value.isnull()
df = df.ffill(limit=2).mask(
v.groupby(v.ne(v.shift()).cumsum()).transform('size').gt(2)
)
value
0 1.0
1 1.0
2 1.0
3 1.0
4 2.0
5 3.0
6 NaN
7 NaN
8 NaN
9 NaN
10 1.0
11 1.0
Upvotes: 2