Reputation: 345
I have a column named "age" with a few NaN; crude logic of deriving the value of the age is finding the mean of age using 2 key categorical variables - job, gender
df = pd.DataFrame([[1,2,1,2,3,4,11,12,13,12,11,1,10], [19,23,np.nan,29,np.nan,32,27,48,39,70,29,51,np.nan],
['a','b','c','d','e','a','b','c','d','e','a','b','c'],['M','F','M','F','M','F','M','F','M','M','F','F','F']]).T
df.columns = ['col1','age','job','gender']
df = df.astype({"col1": int, "age": float})
df['job'] = df.job.astype('category')
df['gender'] = df.gender.astype('category')
df
col1 age job gender
0 1 19.0 a M
1 2 23.0 b F
2 1 NaN c M
3 2 29.0 d F
4 3 NaN e M
5 4 32.0 a F
6 11 27.0 b M
7 12 48.0 c F
8 13 39.0 d M
9 12 70.0 e M
10 11 29.0 a F
11 1 51.0 b F
12 10 NaN c M
df.groupby(['job','gender']).mean().reset_index()
job gender col1 age
0 a F 7.500000 30.5
1 a M 1.000000 19.0
2 b F 1.500000 37.0
3 b M 11.000000 27.0
4 c F NaN NaN
5 c M 7.666667 48.0
6 d F 7.500000 34.0
7 d M NaN NaN
8 e F NaN NaN
9 e M 7.500000 70.0
I want to update the age to the derived value from above. What is the optimal way of doing it? Should I store it in another dataframe and loop it through for updation?
Resultant output should look like this:
col1 age job gender
0 1 19.0 a M
1 2 23.0 b F
2 1 48.0 c M
3 2 29.0 d F
4 3 70.0 e M
5 4 32.0 a F
6 11 27.0 b M
7 12 48.0 c F
8 13 39.0 d M
9 12 70.0 e M
10 11 29.0 a F
11 1 51.0 b F
12 10 70.0 c M
Thanks.
Upvotes: 2
Views: 100
Reputation: 862481
Use Series.fillna
with GroupBy.transform
, but because in sample data are not data for combination c, M
there is NaN
:
df['age'] = df['age'].fillna(df.groupby(['job','gender'])['age'].transform('mean'))
print (df)
col1 age job gender
0 1 19.0 a M
1 2 23.0 b F
2 1 NaN c M
3 2 29.0 d F
4 3 70.0 e M
5 4 32.0 a F
6 11 27.0 b M
7 12 48.0 c F
8 13 39.0 d M
9 12 70.0 e M
10 11 29.0 a F
11 1 51.0 b F
12 10 48.0 c F
If need also replace NaN
by groiping only by id
add another fillna
:
avg1 = df.groupby(['job','gender'])['age'].transform('mean')
avg2 = df.groupby('job')['age'].transform('mean')
df['age'] = df['age'].fillna(avg1).fillna(avg2)
print (df)
col1 age job gender
0 1 19.0 a M
1 2 23.0 b F
2 1 48.0 c M
3 2 29.0 d F
4 3 70.0 e M
5 4 32.0 a F
6 11 27.0 b M
7 12 48.0 c F
8 13 39.0 d M
9 12 70.0 e M
10 11 29.0 a F
11 1 51.0 b F
12 10 48.0 c F
Upvotes: 2