Raven Cheuk
Raven Cheuk

Reputation: 3053

Strange behavior when using lambda function on Pandas's groupby

I have a pandas DataFrame with two groups 'A' and 'B', and one element is missing in each of the group.

df4 = pd.DataFrame({'Name' : ['A', 'A', 'A', 'A', 'B', 'B', 'B'], 
                    'X' : [0, 0.5,1, np.nan, 1,np.nan,1]})

Name    X
A       0.0
A       0.5
A       1.0
A       nan
B       1.0
B       nan
B       1.0

I would like to use a lambda function to fill in the missing data for each group

Correct behavior when using x.mean()

df4.groupby('Name')['X'].transform(lambda x: x.fillna(x.mean()))
0    0.0
1    0.5
2    1.0
3    0.5 <------ Filled as 0.5
4    1.0
5    1.0 <------ Filled as 1
6    1.0

If I use x.mean() as shown above, the behavior is correct, since in group A, the mean is 1.5/3 which is 0.5. The same goes for group B.

Strange behavior when using x.std()

However, if I use x.std() instead, the filled number doesn't make sense to me. For group A, there's only three existing elements, 0, 0.5, and 1.0, and their standard deviation should be 0.408. Yet, the lambda function gives me the following output.

df4.groupby('Name')['X'].transform(lambda x: x.fillna(x.std()))
0    0.0
1    0.5
2    1.0
3    0.5 <------ Filled as 0.5 instead of 0.4082
4    1.0
5    0.0 <------ Correct
6    1.0 

Can anyone explain the behavior? Where does that 0.5 comes from?

Upvotes: 1

Views: 64

Answers (1)

jezrael
jezrael

Reputation: 863291

Need to change default parameter of pandas.Series.std ddof=1 to ddof=0:

print (df4.groupby('Name')['X'].transform(lambda x: x.fillna(x.std(ddof=0))))
0    0.000000
1    0.500000
2    1.000000
3    0.408248
4    1.000000
5    0.000000
6    1.000000
Name: X, dtype: float64

Upvotes: 3

Related Questions