Reputation: 3053
I have a pandas DataFrame
with two groups 'A'
and 'B'
, and one element is missing in each of the group.
df4 = pd.DataFrame({'Name' : ['A', 'A', 'A', 'A', 'B', 'B', 'B'],
'X' : [0, 0.5,1, np.nan, 1,np.nan,1]})
Name X
A 0.0
A 0.5
A 1.0
A nan
B 1.0
B nan
B 1.0
I would like to use a lambda function to fill in the missing data for each group
x.mean()
df4.groupby('Name')['X'].transform(lambda x: x.fillna(x.mean()))
0 0.0
1 0.5
2 1.0
3 0.5 <------ Filled as 0.5
4 1.0
5 1.0 <------ Filled as 1
6 1.0
If I use x.mean()
as shown above, the behavior is correct, since in group A, the mean is 1.5/3
which is 0.5
. The same goes for group B.
x.std()
However, if I use x.std()
instead, the filled number doesn't make sense to me. For group A, there's only three existing elements, 0
, 0.5
, and 1.0
, and their standard deviation should be 0.408
. Yet, the lambda function gives me the following output.
df4.groupby('Name')['X'].transform(lambda x: x.fillna(x.std()))
0 0.0
1 0.5
2 1.0
3 0.5 <------ Filled as 0.5 instead of 0.4082
4 1.0
5 0.0 <------ Correct
6 1.0
Can anyone explain the behavior? Where does that 0.5 comes from?
Upvotes: 1
Views: 64
Reputation: 863291
Need to change default parameter of pandas.Series.std
ddof=1
to ddof=0
:
print (df4.groupby('Name')['X'].transform(lambda x: x.fillna(x.std(ddof=0))))
0 0.000000
1 0.500000
2 1.000000
3 0.408248
4 1.000000
5 0.000000
6 1.000000
Name: X, dtype: float64
Upvotes: 3