Reputation:
This was very difficult to phrase. But let me show you what I'm trying to accomplish.
df
Y X
a 10
a 5
a NaN
b 12
b 13
b NaN
c 5
c NaN
c 5
c 6
Y: 10 non-null object
X: 7 non-null int64
Take category 'a' from column Y, it has the median X value (10+5/2), the other missing value for 'a' must be filled with this median value.
Similarly, for category 'b' from column Y, among the non missing values in column X, the median X values is, (12+13/2)
For category 'c' from column Y, among the non missing values in column X, the median X values is, 5 (middle most value)
I used a very long, repetitive code as follows.
grouped = df.groupby(['Y'])[['X']]
grouped.agg([np.median])
X
median
Y
a 7.5
b 12.5
c 5
df.X = df.X.fillna(-1)
df.loc[(df['Y'] == 'a') & (df['X'] == -1), 'X'] = 7.5
df.loc[(df['Y'] == 'b') & (df['X'] == -1), 'X'] = 12.5
df.loc[(df['Y'] == 'c') & (df['X'] == -1), 'X'] = 5
I was told that there is not only repetition but also the use of magic numbers, which should be avoided.
I want to write a function that does this filling efficiently.
Upvotes: 1
Views: 1362
Reputation: 294228
Use groupby
and transform
The transform looks like
df.groupby('Y').X.transform('median')
0 7.5
1 7.5
2 7.5
3 12.5
4 12.5
5 12.5
6 5.0
7 5.0
8 5.0
9 5.0
Name: X, dtype: float64
And this has the same index as before. Therefore we can easily use it to fillna
df.X.fillna(df.groupby('Y').X.transform('median'))
0 10.0
1 5.0
2 7.5
3 12.0
4 13.0
5 12.5
6 5.0
7 5.0
8 5.0
9 6.0
Name: X, dtype: float64
You can either make a new copy of the dataframe
df.assign(X=df.X.fillna(df.groupby('Y').X.transform('median')))
Y X
0 a 10.0
1 a 5.0
2 a 7.5
3 b 12.0
4 b 13.0
5 b 12.5
6 c 5.0
7 c 5.0
8 c 5.0
9 c 6.0
Or fillna
values in place
df.X.fillna(df.groupby('Y').X.transform('median'), inplace=True)
df
Y X
0 a 10.0
1 a 5.0
2 a 7.5
3 b 12.0
4 b 13.0
5 b 12.5
6 c 5.0
7 c 5.0
8 c 5.0
9 c 6.0
Upvotes: 1