Filling in Pandas DataFrame missing data based on mulit-column GroupBy

Question

I have a dataset that I am trying to fill in blank cells. I am doing this by grouping (using groupby) the data on two of the columns and then propagating non-NaN (i.e. non-blank) data forward (ffill) and backward (bfill) within those groups. For some reason, rows that should be in separate groups are being filled when they shouldn't be. For example, in column "B", row "d" shouldn't be changed because it should be in its own group with no other rows to take data from. Same thing for "g". Yet, both these rows are filled in based on the data around them. Why is that and how can I get my desired result?

Code:

import pandas as pd
import numpy as np

df = pd.DataFrame({"A": [1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 6, 6],
                   "B": ['a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'c', 'e', 'e', 'g', 'h', 'h'],
                   "C": [5.0, np.nan, 4.0, 4.0, np.nan, 9.0, np.nan, np.nan, 9.0, 8.0, np.nan, 2.0, np.nan, 3.0],
                   "D": [1.0, 1.0, np.nan, 2.0, np.nan, np.nan, np.nan, np.nan, 3.0, 2.0, np.nan, 9.0, np.nan, 3.0],
                   "E": [np.nan, 6.0, np.nan, 3.0, np.nan, np.nan, 7.0, np.nan, 7.0, 2.0, np.nan, np.nan, np.nan, 0.0]})

cols_to_groupby = ["A", "B"]
cols_to_fill = ["C", "D", "E"]

df[cols_to_fill] = df[cols_to_fill + cols_to_groupby].groupby(cols_to_groupby).ffill().bfill()

Sample dataset:

A B   C   D   E
1 a 5.0 1.0 NaN
1 a NaN 1.0 6.0
2 b 4.0 NaN NaN
2 b 4.0 2.0 3.0
2 b NaN NaN NaN
3 c 9.0 NaN NaN
3 c NaN NaN 7.0
3 d NaN NaN NaN
3 c 9.0 3.0 7.0
4 e 8.0 2.0 2.0
4 e NaN NaN NaN
5 g 2.0 9.0 NaN
6 h NaN NaN NaN
6 h 3.0 3.0 0.0

Expected outcome:

A B   C   D   E
1 a 5.0 1.0 6.0
1 a 5.0 1.0 6.0
2 b 4.0 2.0 3.0
2 b 4.0 2.0 3.0
2 b 4.0 2.0 3.0
3 c 9.0 3.0 7.0
3 c 9.0 3.0 7.0
3 d NaN NaN NaN
3 c 9.0 3.0 7.0
4 e 8.0 2.0 2.0
4 e 8.0 2.0 2.0
5 g 2.0 9.0 NaN
6 h 3.0 3.0 0.0
6 h 3.0 3.0 0.0

Actual outcome:

A B   C   D   E
1 a 5.0 1.0 6.0
1 a 5.0 1.0 6.0
2 b 4.0 2.0 3.0
2 b 4.0 2.0 3.0
2 b 4.0 2.0 3.0
3 c 9.0 3.0 7.0
3 c 9.0 3.0 7.0
3 d 9.0 3.0 7.0
3 c 9.0 3.0 7.0
4 e 8.0 2.0 2.0
4 e 8.0 2.0 2.0
5 g 2.0 9.0 0.0
6 h 3.0 3.0 0.0
6 h 3.0 3.0 0.0

Edit:

The accepted solution works for this particular case, but does not work properly if any of the columns being grouped on are missing values. The best solution, as of now, can be found in the accepted answer in this post: Pandas deleting cells when filling groups with NaN in groupby column.

jezrael · Accepted Answer

Here, it is necessary to use a lambda function to call ffill and bfill per groups:

df[cols_to_fill] = df.groupby(cols_to_groupby)[cols_to_fill].apply(lambda x: x.ffill().bfill())

Alternative:

df[cols_to_fill] = df.groupby(cols_to_groupby)[cols_to_fill].transform(lambda x: x.ffill().bfill())

print(df)
    A  B    C    D    E
0   1  a  5.0  1.0  6.0
1   1  a  5.0  1.0  6.0
2   2  b  4.0  2.0  3.0
3   2  b  4.0  2.0  3.0
4   2  b  4.0  2.0  3.0
5   3  c  9.0  3.0  7.0
6   3  c  9.0  3.0  7.0
7   3  d  NaN  NaN  NaN
8   3  c  9.0  3.0  7.0
9   4  e  8.0  2.0  2.0
10  4  e  8.0  2.0  2.0
11  5  g  2.0  9.0  NaN
12  6  h  3.0  3.0  0.0
13  6  h  3.0  3.0  0.0

Your solution is forward filling values per groups, but back filling values outside of the groups:

print(df[cols_to_fill + cols_to_groupby].groupby(cols_to_groupby).ffill())
      C    D    E
0   5.0  1.0  NaN
1   5.0  1.0  6.0
2   4.0  NaN  NaN
3   4.0  2.0  3.0
4   4.0  2.0  3.0
5   9.0  NaN  NaN
6   9.0  NaN  7.0
7   NaN  NaN  NaN
8   9.0  3.0  7.0
9   8.0  2.0  2.0
10  8.0  2.0  2.0
11  2.0  9.0  NaN
12  NaN  NaN  NaN
13  3.0  3.0  0.0

print(df[cols_to_fill + cols_to_groupby].groupby(cols_to_groupby).ffill().bfill())
      C    D    E
0   5.0  1.0  6.0
1   5.0  1.0  6.0
2   4.0  2.0  3.0
3   4.0  2.0  3.0
4   4.0  2.0  3.0
5   9.0  3.0  7.0
6   9.0  3.0  7.0
7   9.0  3.0  7.0
8   9.0  3.0  7.0
9   8.0  2.0  2.0
10  8.0  2.0  2.0
11  2.0  9.0  0.0
12  3.0  3.0  0.0
13  3.0  3.0  0.0

Filling in Pandas DataFrame missing data based on mulit-column GroupBy

Answers (1)

Related Questions