Reputation: 9121
I have a dataset that I am trying to fill in blank cells. I am doing this by grouping (using groupby
) the data on two of the columns and then propagating non-NaN (i.e. non-blank) data forward (ffill
) and backward (bfill
) within those groups. For some reason, rows that should be in separate groups are being filled when they shouldn't be. For example, in column "B", row "d" shouldn't be changed because it should be in its own group with no other rows to take data from. Same thing for "g". Yet, both these rows are filled in based on the data around them. Why is that and how can I get my desired result?
Code:
import pandas as pd
import numpy as np
df = pd.DataFrame({"A": [1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 6, 6],
"B": ['a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'c', 'e', 'e', 'g', 'h', 'h'],
"C": [5.0, np.nan, 4.0, 4.0, np.nan, 9.0, np.nan, np.nan, 9.0, 8.0, np.nan, 2.0, np.nan, 3.0],
"D": [1.0, 1.0, np.nan, 2.0, np.nan, np.nan, np.nan, np.nan, 3.0, 2.0, np.nan, 9.0, np.nan, 3.0],
"E": [np.nan, 6.0, np.nan, 3.0, np.nan, np.nan, 7.0, np.nan, 7.0, 2.0, np.nan, np.nan, np.nan, 0.0]})
cols_to_groupby = ["A", "B"]
cols_to_fill = ["C", "D", "E"]
df[cols_to_fill] = df[cols_to_fill + cols_to_groupby].groupby(cols_to_groupby).ffill().bfill()
Sample dataset:
A B C D E
1 a 5.0 1.0 NaN
1 a NaN 1.0 6.0
2 b 4.0 NaN NaN
2 b 4.0 2.0 3.0
2 b NaN NaN NaN
3 c 9.0 NaN NaN
3 c NaN NaN 7.0
3 d NaN NaN NaN
3 c 9.0 3.0 7.0
4 e 8.0 2.0 2.0
4 e NaN NaN NaN
5 g 2.0 9.0 NaN
6 h NaN NaN NaN
6 h 3.0 3.0 0.0
Expected outcome:
A B C D E
1 a 5.0 1.0 6.0
1 a 5.0 1.0 6.0
2 b 4.0 2.0 3.0
2 b 4.0 2.0 3.0
2 b 4.0 2.0 3.0
3 c 9.0 3.0 7.0
3 c 9.0 3.0 7.0
3 d NaN NaN NaN
3 c 9.0 3.0 7.0
4 e 8.0 2.0 2.0
4 e 8.0 2.0 2.0
5 g 2.0 9.0 NaN
6 h 3.0 3.0 0.0
6 h 3.0 3.0 0.0
Actual outcome:
A B C D E
1 a 5.0 1.0 6.0
1 a 5.0 1.0 6.0
2 b 4.0 2.0 3.0
2 b 4.0 2.0 3.0
2 b 4.0 2.0 3.0
3 c 9.0 3.0 7.0
3 c 9.0 3.0 7.0
3 d 9.0 3.0 7.0
3 c 9.0 3.0 7.0
4 e 8.0 2.0 2.0
4 e 8.0 2.0 2.0
5 g 2.0 9.0 0.0
6 h 3.0 3.0 0.0
6 h 3.0 3.0 0.0
Edit:
The accepted solution works for this particular case, but does not work properly if any of the columns being grouped on are missing values. The best solution, as of now, can be found in the accepted answer in this post: Pandas deleting cells when filling groups with NaN in groupby column.
Upvotes: 2
Views: 77
Reputation: 863531
Here, it is necessary to use a lambda function to call ffill
and bfill
per groups:
df[cols_to_fill] = df.groupby(cols_to_groupby)[cols_to_fill].apply(lambda x: x.ffill().bfill())
Alternative:
df[cols_to_fill] = df.groupby(cols_to_groupby)[cols_to_fill].transform(lambda x: x.ffill().bfill())
print(df)
A B C D E
0 1 a 5.0 1.0 6.0
1 1 a 5.0 1.0 6.0
2 2 b 4.0 2.0 3.0
3 2 b 4.0 2.0 3.0
4 2 b 4.0 2.0 3.0
5 3 c 9.0 3.0 7.0
6 3 c 9.0 3.0 7.0
7 3 d NaN NaN NaN
8 3 c 9.0 3.0 7.0
9 4 e 8.0 2.0 2.0
10 4 e 8.0 2.0 2.0
11 5 g 2.0 9.0 NaN
12 6 h 3.0 3.0 0.0
13 6 h 3.0 3.0 0.0
Your solution is forward filling values per groups, but back filling values outside of the groups:
print(df[cols_to_fill + cols_to_groupby].groupby(cols_to_groupby).ffill())
C D E
0 5.0 1.0 NaN
1 5.0 1.0 6.0
2 4.0 NaN NaN
3 4.0 2.0 3.0
4 4.0 2.0 3.0
5 9.0 NaN NaN
6 9.0 NaN 7.0
7 NaN NaN NaN
8 9.0 3.0 7.0
9 8.0 2.0 2.0
10 8.0 2.0 2.0
11 2.0 9.0 NaN
12 NaN NaN NaN
13 3.0 3.0 0.0
print(df[cols_to_fill + cols_to_groupby].groupby(cols_to_groupby).ffill().bfill())
C D E
0 5.0 1.0 6.0
1 5.0 1.0 6.0
2 4.0 2.0 3.0
3 4.0 2.0 3.0
4 4.0 2.0 3.0
5 9.0 3.0 7.0
6 9.0 3.0 7.0
7 9.0 3.0 7.0
8 9.0 3.0 7.0
9 8.0 2.0 2.0
10 8.0 2.0 2.0
11 2.0 9.0 0.0
12 3.0 3.0 0.0
13 3.0 3.0 0.0
Upvotes: 1