stoopido
stoopido

Reputation: 33

How to fill NAs with median of means of 2-column groupby in pandas?

Working with pandas, I have a dataframe with two hierarchies A and B, where B can be NaN, and I want to fill some NaNs in D in a particular way:

In the example below, A has "B-subgroups" where there are no values at all for D (e.g. (1, 1)), while A also has values for D in other subgroups (e.g. (1, 3)).

Now I want to get the mean of each subgroup (120, 90 and 75 for A==1), find the median of these means (90 for A==1) and use this median to fill NaNs in the other subgroups of A==1.

Groups like A==2, where there are only NaNs for D, should not be filled.

Groups like A==3, where there are some values for D but only rows with B being NaN have NaN in D, should not be filled if possible (I intend to fill these later with the mean of all values of D of their whole A groups).

Example df:

d = {'A': [1, 1, 1, 1, 1, 1, 1, 2, 3, 3, 3], 
    'B': [1, 2, 3, 3, 4, 5, 6, 1, 1, np.NaN, np.NaN], 
    'D': [np.NaN, np.NaN, 120, 120, 90, 75, np.NaN, np.NaN, 60, 50, np.NaN]}
df = pd.DataFrame(data=d)
A       B       D
1       1       NaN
1       2       NaN
1       3       120
1       3       120
1       4       90
1       5       75
1       6       NaN
2       1       NaN
3       1       60
3       NaN     50
3       NaN     NaN

Expected result:

A       B       D
1       1       90
1       2       90
1       3       120
1       3       120
1       4       90
1       5       75
1       6       90
2       1       NaN
3       1       60
3       NaN     50
3       NaN     NaN

With df.groupby(['A', 'B'])['D'].mean().groupby(['A']).agg('median') or .median() I seem to get the right values, but using

df['D'] = df['D'].fillna(
    df.groupby(['A', 'B'])['D'].mean().groupby(['A']).agg('median')
)

does not seem to change any values in D.

Any help is greatly appreciated, I've been stuck on this for a while and cannot find any solution anywhere.

Upvotes: 3

Views: 286

Answers (1)

Erfan
Erfan

Reputation: 42886

Your first step is correct. After that we use Series.map to map the correct medians to each group in column A.

Finally we use np.where to conditionally fill in column D if B is not NaN:

medians = df.groupby(['A', 'B'])['D'].mean().groupby(['A']).agg('median')
df['D'] = np.where(df['B'].notna(),                        # if B is not NaN
                   df['D'].fillna(df['A'].map(medians)),   # fill in the median
                   df['D'])                                # else keep the value of column D
    A    B      D
0   1 1.00  90.00
1   1 2.00  90.00
2   1 3.00 120.00
3   1 3.00 120.00
4   1 4.00  90.00
5   1 5.00  75.00
6   1 6.00  90.00
7   2 1.00    nan
8   3 1.00  60.00
9   3  nan  50.00
10  3  nan    nan

Upvotes: 3

Related Questions