Reputation: 5450
I am working with a pandas DataFrame Top15
that contains a population data of 15 countries in the world.
Population
Country
China 1.367645e+09
United States 3.176154e+08
Japan 1.274094e+08
United Kingdom 6.387097e+07
Russian Federation 1.435000e+08
Canada 3.523986e+07
Germany 8.036970e+07
India 1.276731e+09
France 6.383735e+07
South Korea 4.980543e+07
Italy 5.990826e+07
Spain 4.644340e+07
Iran 7.707563e+07
Australia 2.331602e+07
Brazil 2.059153e+08
Now I want to see the continent-wise statistics for these data. So I created a groupby object using a dictionary:
df = Top15.groupby(ContinentDict)
where:
ContinentDict = {'China':'Asia',
'United States':'North America',
'Japan':'Asia',
'United Kingdom':'Europe',
'Russian Federation':'Europe',
'Canada':'North America',
'Germany':'Europe',
'India':'Asia',
'France':'Europe',
'South Korea':'Asia',
'Italy':'Europe',
'Spain':'Europe',
'Iran':'Asia',
'Australia':'Australia',
'Brazil':'South America'}
and then I am creating a new dataframe that will contain the various statistics:
new_df = pd.DataFrame({'size' : df.size().values, 'sum' : df.sum().values, 'mean' : df.mean().values, 'std' : df.std().values}, index = df.groups.keys())
I get the following output:
mean size std sum
North America 5.797333e+08 5 6.790979e+08 2.898666e+09
Asia 2.331602e+07 1 NaN 2.331602e+07
South America 7.632161e+07 6 3.464767e+07 4.579297e+08
Europe 1.764276e+08 2 1.996696e+08 3.528552e+08
Australia 2.059153e+08 1 NaN 2.059153e+08
As you can see, in the standard deviation column, two NaN
values (For Asia and Australia).
After this, I tried looking at the individual values
df.std()
and I get:
Asia 6.790979e+08
Australia NaN
Europe 3.464767e+07
North America 1.996696e+08
South America NaN
Name: Population, dtype: float64
Now Asia is completely fine and South America is not! I do not have any NaN
values in my original dataframe. How does one explain this strange behavior and how to fix it?
Upvotes: 1
Views: 1615
Reputation: 251388
That is not a good way to get groupby statistics. Just compute the statistics directly on the grouped object by passing a list of function names to agg
:
>>> d.groupby(ContinentDict).Population.agg(['size', 'mean', 'std', 'sum'])
size mean std sum
Asia 5 5.797333e+08 6.790979e+08 2.898666e+09
Australia 1 2.331602e+07 NaN 2.331602e+07
Europe 6 7.632161e+07 3.464767e+07 4.579297e+08
North America 2 1.764276e+08 1.996697e+08 3.528553e+08
South America 1 2.059153e+08 NaN 2.059153e+08
(You can use strings because all the functions you're using are builtin pandas methods and so are special cased. If you wanted to compute any custom function, you'd pass the actual function object.)
As for the NaNs, those occur where there was only one country on a given continent. The sample standard deviation of a single number is undefined, and pandas uses the sample standard deviation by default. (You can get the population stdev by calling .std(ddof=0)
, which will give you zero for these cases.)
The reason you were seeing the NaNs in different places before is because you explicitly passed .groups.keys()
as an index. .groups
is just a dictionary, so its .key()
may be in any arbitrary order. What happened is that the results you got from computing mean, std, etc. were in a different order than the keys you got from the dict. There's no need to compute the various summary statistics separately as you were doing; you can do them all at once with .agg
and pandas will make sure everything is aligned for you.
Upvotes: 3