Peaceful
Peaceful

Reputation: 5450

NaN values in the groupby statistics in pandas

I am working with a pandas DataFrame Top15 that contains a population data of 15 countries in the world.

                      Population
Country                         
China               1.367645e+09
United States       3.176154e+08
Japan               1.274094e+08
United Kingdom      6.387097e+07
Russian Federation  1.435000e+08
Canada              3.523986e+07
Germany             8.036970e+07
India               1.276731e+09
France              6.383735e+07
South Korea         4.980543e+07
Italy               5.990826e+07
Spain               4.644340e+07
Iran                7.707563e+07
Australia           2.331602e+07
Brazil              2.059153e+08

Now I want to see the continent-wise statistics for these data. So I created a groupby object using a dictionary:

df = Top15.groupby(ContinentDict)

where:

ContinentDict  = {'China':'Asia', 
              'United States':'North America', 
              'Japan':'Asia', 
              'United Kingdom':'Europe', 
              'Russian Federation':'Europe', 
              'Canada':'North America', 
              'Germany':'Europe', 
              'India':'Asia',
              'France':'Europe', 
              'South Korea':'Asia', 
              'Italy':'Europe', 
              'Spain':'Europe', 
              'Iran':'Asia',
              'Australia':'Australia', 
              'Brazil':'South America'}

and then I am creating a new dataframe that will contain the various statistics:

new_df = pd.DataFrame({'size' : df.size().values, 'sum' : df.sum().values, 'mean' : df.mean().values, 'std' : df.std().values}, index = df.groups.keys())

I get the following output:

                       mean  size           std           sum
North America  5.797333e+08     5  6.790979e+08  2.898666e+09
Asia           2.331602e+07     1           NaN  2.331602e+07
South America  7.632161e+07     6  3.464767e+07  4.579297e+08
Europe         1.764276e+08     2  1.996696e+08  3.528552e+08
Australia      2.059153e+08     1           NaN  2.059153e+08

As you can see, in the standard deviation column, two NaN values (For Asia and Australia).

After this, I tried looking at the individual values

df.std()

and I get:

    Asia             6.790979e+08
    Australia                 NaN
    Europe           3.464767e+07
    North America    1.996696e+08
    South America             NaN

Name: Population, dtype: float64

Now Asia is completely fine and South America is not! I do not have any NaN values in my original dataframe. How does one explain this strange behavior and how to fix it?

Upvotes: 1

Views: 1615

Answers (1)

BrenBarn
BrenBarn

Reputation: 251388

That is not a good way to get groupby statistics. Just compute the statistics directly on the grouped object by passing a list of function names to agg:

>>> d.groupby(ContinentDict).Population.agg(['size', 'mean', 'std', 'sum'])
               size          mean           std           sum
Asia              5  5.797333e+08  6.790979e+08  2.898666e+09
Australia         1  2.331602e+07           NaN  2.331602e+07
Europe            6  7.632161e+07  3.464767e+07  4.579297e+08
North America     2  1.764276e+08  1.996697e+08  3.528553e+08
South America     1  2.059153e+08           NaN  2.059153e+08

(You can use strings because all the functions you're using are builtin pandas methods and so are special cased. If you wanted to compute any custom function, you'd pass the actual function object.)

As for the NaNs, those occur where there was only one country on a given continent. The sample standard deviation of a single number is undefined, and pandas uses the sample standard deviation by default. (You can get the population stdev by calling .std(ddof=0), which will give you zero for these cases.)

The reason you were seeing the NaNs in different places before is because you explicitly passed .groups.keys() as an index. .groups is just a dictionary, so its .key() may be in any arbitrary order. What happened is that the results you got from computing mean, std, etc. were in a different order than the keys you got from the dict. There's no need to compute the various summary statistics separately as you were doing; you can do them all at once with .agg and pandas will make sure everything is aligned for you.

Upvotes: 3

Related Questions