Reputation: 1852
I'm looking to fill in missing values of one column with the mode of the value from another column. Let's say this is our data set (borrowed from Chris Albon):
import pandas as pd
import numpy as np
raw_data = {'first_name': ['Jake', 'Jake', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Smith', 'Ali', 'Milner', 'Cooze'],
'age': [42, np.nan, 36, 24, 73],
'sex': ['m', np.nan, 'f', 'm', 'f'],
'preTestScore': [4, np.nan, np.nan, 2, 3],
'postTestScore': [25, np.nan, np.nan, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'preTestScore', 'postTestScore'])
df
I know we can fill in missing postTestScore with each sex's mean value of postTestScore with:
df["postTestScore"].fillna(df.groupby("sex")["postTestScore"].transform("mean"), inplace=True)
df
But how would we fill in missing sex with each first name's mode value of sex (obviously this is not politically correct, but as an example this was an easy data set to use). So for this example the missing sex value would be 'm' because there are two Jake's with the value 'm'. If there were a Jake with value 'f' it would still pick 'm' as the mode value because 2 > 1. It would be nice if you could do:
df["sex"].fillna(df.groupby("first_name")["sex"].transform("mode"), inplace=True)
df
I looked into value_counts and apply but couldn't find this specific case. My ultimate goal is to be able to look at one column and if that doesn't have a mode value then to look at another column for a mode value.
Upvotes: 0
Views: 756
Reputation: 323226
You need call the mode function with pd.Series.mode
df.groupby("first_name")["sex"].transform(pd.Series.mode)
Out[432]:
0 m
1 m
2 f
3 m
4 f
Name: sex, dtype: object
Upvotes: 1