jared
jared

Reputation: 9121

Filling in Pandas DataFrame missing data based on mulit-column GroupBy

I have a dataset that I am trying to fill in blank cells. I am doing this by grouping (using groupby) the data on two of the columns and then propagating non-NaN (i.e. non-blank) data forward (ffill) and backward (bfill) within those groups. For some reason, rows that should be in separate groups are being filled when they shouldn't be. For example, in column "B", row "d" shouldn't be changed because it should be in its own group with no other rows to take data from. Same thing for "g". Yet, both these rows are filled in based on the data around them. Why is that and how can I get my desired result?

Code:

import pandas as pd
import numpy as np

df = pd.DataFrame({"A": [1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 6, 6],
                   "B": ['a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'c', 'e', 'e', 'g', 'h', 'h'],
                   "C": [5.0, np.nan, 4.0, 4.0, np.nan, 9.0, np.nan, np.nan, 9.0, 8.0, np.nan, 2.0, np.nan, 3.0],
                   "D": [1.0, 1.0, np.nan, 2.0, np.nan, np.nan, np.nan, np.nan, 3.0, 2.0, np.nan, 9.0, np.nan, 3.0],
                   "E": [np.nan, 6.0, np.nan, 3.0, np.nan, np.nan, 7.0, np.nan, 7.0, 2.0, np.nan, np.nan, np.nan, 0.0]})

cols_to_groupby = ["A", "B"]
cols_to_fill = ["C", "D", "E"]

df[cols_to_fill] = df[cols_to_fill + cols_to_groupby].groupby(cols_to_groupby).ffill().bfill()

Sample dataset:

A B   C   D   E
1 a 5.0 1.0 NaN
1 a NaN 1.0 6.0
2 b 4.0 NaN NaN
2 b 4.0 2.0 3.0
2 b NaN NaN NaN
3 c 9.0 NaN NaN
3 c NaN NaN 7.0
3 d NaN NaN NaN
3 c 9.0 3.0 7.0
4 e 8.0 2.0 2.0
4 e NaN NaN NaN
5 g 2.0 9.0 NaN
6 h NaN NaN NaN
6 h 3.0 3.0 0.0

Expected outcome:

A B   C   D   E
1 a 5.0 1.0 6.0
1 a 5.0 1.0 6.0
2 b 4.0 2.0 3.0
2 b 4.0 2.0 3.0
2 b 4.0 2.0 3.0
3 c 9.0 3.0 7.0
3 c 9.0 3.0 7.0
3 d NaN NaN NaN
3 c 9.0 3.0 7.0
4 e 8.0 2.0 2.0
4 e 8.0 2.0 2.0
5 g 2.0 9.0 NaN
6 h 3.0 3.0 0.0
6 h 3.0 3.0 0.0

Actual outcome:

A B   C   D   E
1 a 5.0 1.0 6.0
1 a 5.0 1.0 6.0
2 b 4.0 2.0 3.0
2 b 4.0 2.0 3.0
2 b 4.0 2.0 3.0
3 c 9.0 3.0 7.0
3 c 9.0 3.0 7.0
3 d 9.0 3.0 7.0
3 c 9.0 3.0 7.0
4 e 8.0 2.0 2.0
4 e 8.0 2.0 2.0
5 g 2.0 9.0 0.0
6 h 3.0 3.0 0.0
6 h 3.0 3.0 0.0

Edit:

The accepted solution works for this particular case, but does not work properly if any of the columns being grouped on are missing values. The best solution, as of now, can be found in the accepted answer in this post: Pandas deleting cells when filling groups with NaN in groupby column.

Upvotes: 2

Views: 77

Answers (1)

jezrael
jezrael

Reputation: 863531

Here, it is necessary to use a lambda function to call ffill and bfill per groups:

df[cols_to_fill] = df.groupby(cols_to_groupby)[cols_to_fill].apply(lambda x: x.ffill().bfill())

Alternative:

df[cols_to_fill] = df.groupby(cols_to_groupby)[cols_to_fill].transform(lambda x: x.ffill().bfill())

print(df)
    A  B    C    D    E
0   1  a  5.0  1.0  6.0
1   1  a  5.0  1.0  6.0
2   2  b  4.0  2.0  3.0
3   2  b  4.0  2.0  3.0
4   2  b  4.0  2.0  3.0
5   3  c  9.0  3.0  7.0
6   3  c  9.0  3.0  7.0
7   3  d  NaN  NaN  NaN
8   3  c  9.0  3.0  7.0
9   4  e  8.0  2.0  2.0
10  4  e  8.0  2.0  2.0
11  5  g  2.0  9.0  NaN
12  6  h  3.0  3.0  0.0
13  6  h  3.0  3.0  0.0

Your solution is forward filling values per groups, but back filling values outside of the groups:

print(df[cols_to_fill + cols_to_groupby].groupby(cols_to_groupby).ffill())
      C    D    E
0   5.0  1.0  NaN
1   5.0  1.0  6.0
2   4.0  NaN  NaN
3   4.0  2.0  3.0
4   4.0  2.0  3.0
5   9.0  NaN  NaN
6   9.0  NaN  7.0
7   NaN  NaN  NaN
8   9.0  3.0  7.0
9   8.0  2.0  2.0
10  8.0  2.0  2.0
11  2.0  9.0  NaN
12  NaN  NaN  NaN
13  3.0  3.0  0.0

print(df[cols_to_fill + cols_to_groupby].groupby(cols_to_groupby).ffill().bfill())
      C    D    E
0   5.0  1.0  6.0
1   5.0  1.0  6.0
2   4.0  2.0  3.0
3   4.0  2.0  3.0
4   4.0  2.0  3.0
5   9.0  3.0  7.0
6   9.0  3.0  7.0
7   9.0  3.0  7.0
8   9.0  3.0  7.0
9   8.0  2.0  2.0
10  8.0  2.0  2.0
11  2.0  9.0  0.0
12  3.0  3.0  0.0
13  3.0  3.0  0.0

Upvotes: 1

Related Questions