Reputation: 71
I am trying to replace null values in a column with the average according to the group from another column. I have tried this code and the null values are replaced but not correctly. How so? How should I correct this?
the first two null values should be replaced by a 3 since they belong to group 'A' where the average is 3. the following null value should be 4 since is located in group B that has 4,2,1,5 averaging 3.
column 'z' should have the following values: 3 3 3 5 3 1 2 4 6 9 10 5
xx=float('nan')
data=[['A', 1, xx ],
['B', 5,5],
['C', 4,6]
,['A', 6,xx],
['B',9,xx],
['C', 7,9]
,['A', 2,3],
['B', 5,1],
['C',2,10]
,['B', 8,2],
['B', 5,4],
['C', 8,5 ]]
dff = pd.DataFrame(data, columns=['x','y','z'])
dff = dff.sort_values(by =['x'], ascending=True)
dff.reset_index(drop=True, inplace= True)
print(dff)
dff['z'] = df.groupby(['x'])['z'].transform(lambda x: x.fillna(x.mean()))
print(dff)
Upvotes: 1
Views: 1532
Reputation: 11395
Replacing missing values can be done with .fillna
and group averages with groupby(...).mean(...)
.
A useful trick is to use groupby.transform('mean')
which returns a series or dataframe of the original shape:
>>> dff.groupby('x')['z'].mean()
x
A 3.0
B 3.0
C 7.5
Name: z, dtype: float64
>>> dff.groupby('x')['z'].transform('mean')
0 3.0
1 3.0
2 3.0
3 3.0
4 3.0
5 3.0
6 3.0
7 3.0
8 7.5
9 7.5
10 7.5
11 7.5
Name: z, dtype: float64
Note that the average of group B is 3
not 4
as you expected.
So then it simply becomes:
>>> dff['z'].fillna(dff.groupby('x')['z'].transform('mean'))
0 3.0
1 3.0
2 3.0
3 5.0
4 3.0
5 1.0
6 2.0
7 4.0
8 6.0
9 9.0
10 10.0
11 5.0
Name: z, dtype: float64
Upvotes: 2