Reputation: 970
Let's say the dataframe has the columns name
,category
, rank
, where name
is the name of an individual, category
is a categorical variable , rank
the rank of the individual in one row.
First I wanted the mean for each name
and category
as:
X = df.groupby(['name','category'])['rank'].agg('mean')
#out:
+---------+-------------------+------+
| name | category | |
+---------+-------------------+------+
| 1260229 | 9 | 11.0 |
| | 18 | 9.50 |
| 1126191 | 5 | 4.00 |
| | 17 | 3.00 |
| | 23 | 4.00 |
| 1065670 | 33 | 3.00 |
| | 39 | 5.00 |
| | 41 | 8.00 |
+---------+-------------------+------+
Now the standard deviation,
X.reset_index().groupby('name')['rank'].agg(np.std)
#out:
+---------+------+
| name | |
+---------+------+
| 1260229 | 1.06 |
| 1126191 | 0.58 |
| 1065670 | 2.51 |
+---------+------+
#Note here that "rank" is actually the mean of rank by category. I just didn't change the name
#of the column for the new dataframe issued from X.reset_index()
The problem is when I compute (for the individual 1260229
) as np.std([11,9.50])
it returns 0.75
and not 1.06
, same issue for other individuals.
I don't get where is the wrong manipulation to make these wrong results.
Pandas version: 0.23.4 Python version: 3.7.4
Upvotes: 1
Views: 114
Reputation: 862406
In pandas is default ddof=1
in DataFrame.std
, in numpy numpy.std
is 0
.
You can use instead second groupby only std
with level=0
parameter for simplify solution:
s = X.std(level=0)
print (s)
name
1260229 1.060660
1126191 0.577350
1065670 2.516611
Name: rank, dtype: float64
s = X.std(level=0, ddof=1)
print (s)
name
1260229 1.060660
1126191 0.577350
1065670 2.516611
Name: rank, dtype: float64
And also ddof=0
:
s = X.std(level=0, ddof=0)
print (s)
name
1260229 0.750000
1126191 0.471405
1065670 2.054805
Name: rank, dtype: float64
If want use groupby
also it is possible:
s = X.groupby(level=0, sort=False).std(ddof=0)
print (s)
name
1260229 0.750000
1126191 0.471405
1065670 2.054805
Name: rank, dtype: float64
Upvotes: 2