Confusing results of standard deviation computing in groupby object with pandas

Question

Let's say the dataframe has the columns name,category, rank, where name is the name of an individual, category is a categorical variable , rank the rank of the individual in one row.

First I wanted the mean for each name and category as:

X = df.groupby(['name','category'])['rank'].agg('mean')
#out:
+---------+-------------------+------+
|  name   | category          |      |
+---------+-------------------+------+
| 1260229 |                 9 | 11.0 |
|         |                18 | 9.50 |
| 1126191 |                 5 | 4.00 |
|         |                17 | 3.00 |
|         |                23 | 4.00 |
| 1065670 |                33 | 3.00 |
|         |                39 | 5.00 |
|         |                41 | 8.00 |
+---------+-------------------+------+

Now the standard deviation,

X.reset_index().groupby('name')['rank'].agg(np.std)
#out:
+---------+------+
|  name   |      |
+---------+------+
| 1260229 | 1.06 |
| 1126191 | 0.58 |
| 1065670 | 2.51 |
+---------+------+
#Note here that "rank" is actually the mean of rank by category. I just didn't change the name
#of the column for the new dataframe issued from X.reset_index()

The problem is when I compute (for the individual 1260229) as np.std([11,9.50]) it returns 0.75 and not 1.06, same issue for other individuals.

I don't get where is the wrong manipulation to make these wrong results.

Pandas version: 0.23.4 Python version: 3.7.4

jezrael · Accepted Answer

In pandas is default ddof=1 in DataFrame.std, in numpy numpy.std is 0.

You can use instead second groupby only std with level=0 parameter for simplify solution:

s = X.std(level=0)
print (s)
name
1260229    1.060660
1126191    0.577350
1065670    2.516611
Name: rank, dtype: float64

s = X.std(level=0, ddof=1)
print (s)
name
1260229    1.060660
1126191    0.577350
1065670    2.516611
Name: rank, dtype: float64

And also ddof=0:

s = X.std(level=0, ddof=0)
print (s)
name
1260229    0.750000
1126191    0.471405
1065670    2.054805
Name: rank, dtype: float64

If want use groupby also it is possible:

s = X.groupby(level=0, sort=False).std(ddof=0)
print (s)
name
1260229    0.750000
1126191    0.471405
1065670    2.054805
Name: rank, dtype: float64

Confusing results of standard deviation computing in groupby object with pandas

Answers (1)

Related Questions