Alex J
Alex J

Reputation: 111

Pandas groupby agg std NaN

Inputs:

df['PopEst']
    .astype('float')
    .groupby(ContinentDict)
    .agg(['size','sum','mean','std']))

Outputs:

            size            sum                mean              std
Asia          5     2.898666e+09       5.797333e+08     6.790979e+08
Australia     1     2.331602e+07       2.331602e+07              NaN
Europe        6     4.579297e+08       7.632161e+07     3.464767e+07
North America 2     3.528552e+08       1.764276e+08     1.996696e+08
South America 1     2.059153e+08       2.059153e+08              NaN

Some values in column of std turns out to be NaN if the group just have one row, but I think these values are supposed to be 0, why is that?

Upvotes: 11

Views: 13985

Answers (2)

etudiant
etudiant

Reputation: 121

According to the document, np.std(..., ddof=1) by default set "delta degree of freedom" to 1. To fix your problem, simply replace np.std with lambda x: np.std(x, ddof=0) then your NaN will be changed to 0.

Upvotes: 0

jpp
jpp

Reputation: 164783

pd.DataFrame.std assumes 1 degree of freedom by default, also known as sample standard deviation. This results in NaN results for groups with one number.

numpy.std, by contrast, assumes 0 degree of freedom by default, also known as population standard deviation. This gives 0 for groups with one number.

To understand the difference between sample and population, see Bessel's correction.

Therefore, you can specify numpy.std for your calculation. Note, however, that the output will be different as the calculation is different. Here's a minimal example.

import pandas as pd, numpy as np

df = pd.DataFrame(np.random.randint(0, 9, (5, 2)))

def std(x): return np.std(x)

res = df.groupby(0)[1].agg(['size', 'sum', 'mean', std])

print(res)

   size  sum  mean       std
0                           
0     2   13   6.5       0.5
4     1    3   3.0       0.0
5     1    3   3.0       0.0
6     1    3   3.0       0.0

Alternatively, if you require 1 degree of freedom, you can use fillna to replace NaN values with 0:

res = df.groupby(0)[1].agg(['size', 'sum', 'mean', 'std']).fillna(0)

Upvotes: 14

Related Questions