Reputation: 111
Inputs:
df['PopEst']
.astype('float')
.groupby(ContinentDict)
.agg(['size','sum','mean','std']))
Outputs:
size sum mean std
Asia 5 2.898666e+09 5.797333e+08 6.790979e+08
Australia 1 2.331602e+07 2.331602e+07 NaN
Europe 6 4.579297e+08 7.632161e+07 3.464767e+07
North America 2 3.528552e+08 1.764276e+08 1.996696e+08
South America 1 2.059153e+08 2.059153e+08 NaN
Some values in column of std turns out to be NaN if the group just have one row, but I think these values are supposed to be 0, why is that?
Upvotes: 11
Views: 13985
Reputation: 121
According to the document, np.std(..., ddof=1)
by default set "delta degree of freedom" to 1. To fix your problem, simply replace np.std
with lambda x: np.std(x, ddof=0)
then your NaN
will be changed to 0
.
Upvotes: 0
Reputation: 164783
pd.DataFrame.std
assumes 1 degree of freedom by default, also known as sample standard deviation. This results in NaN
results for groups with one number.
numpy.std
, by contrast, assumes 0 degree of freedom by default, also known as population standard deviation. This gives 0
for groups with one number.
To understand the difference between sample and population, see Bessel's correction.
Therefore, you can specify numpy.std
for your calculation. Note, however, that the output will be different as the calculation is different. Here's a minimal example.
import pandas as pd, numpy as np
df = pd.DataFrame(np.random.randint(0, 9, (5, 2)))
def std(x): return np.std(x)
res = df.groupby(0)[1].agg(['size', 'sum', 'mean', std])
print(res)
size sum mean std
0
0 2 13 6.5 0.5
4 1 3 3.0 0.0
5 1 3 3.0 0.0
6 1 3 3.0 0.0
Alternatively, if you require 1 degree of freedom, you can use fillna
to replace NaN
values with 0
:
res = df.groupby(0)[1].agg(['size', 'sum', 'mean', 'std']).fillna(0)
Upvotes: 14