Reputation: 323
I am workint in Python 2.7 and I have a data frame and I want to get the average of the column called 'c', but only the rows that verify that the values in another column are equal to some value. When I execute the code, the answer is unexpected, but when I execute the calculation, calculating the median, the result is correct.
Why is the output of the mean incorrect?
The code is the following:
df = pd.DataFrame(
np.array([['A', 1, 2, 3], ['A', 4, 5, np.nan], ['A', 7, 8, 9], ['B', 3, 2, np.nan], ['B', 5, 6, np.nan], ['B',5, 6, np.nan]]),
columns=['a', 'b', 'c', 'd']
)
df
mean1 = df[df.a == 'A'].c.mean()
mean2 = df[df.a == 'B'].c.mean()
median1 = df[df.a == 'A'].c.median()
median2 = df[df.a == 'B'].c.median()
The output:
df
Out[1]:
a b c d
0 A 1 2 3
1 A 4 5 nan
2 A 7 8 9
3 B 3 2 nan
4 B 5 6 nan
5 B 5 6 nan
mean1
Out[2]: 86.0
mean2
Out[3]: 88.66666666666667
median1
Out[4]: 5.0
median2
Out[5]: 6.0
It is obvious that the output of the mean is incorrect.
Thanks.
Upvotes: 3
Views: 2499
Reputation: 51155
Pandas is doing string concatenation for the "sum" when calculating the mean, this is plain to see from your example frame.
>>> df[df.a == 'B'].c
3 2
4 6
5 6
Name: c, dtype: object
>>> 266 / 3
88.66666666666667
If you look at the dtype
's for your DataFrame, you'll notice that all of them are object
, even though no single Series
contains mixed types. This is due to the declaration of your numpy
array. Arrays are not meant to contain heterogenous types, so the array defaults to dtype object
, which is then passed to the DataFrame constructor. You can avoid this behavior by passing the constructor a list instead, which can hold differing dtype
's with no issues.
df = pd.DataFrame(
[['A', 1, 2, 3], ['A', 4, 5, np.nan], ['A', 7, 8, 9], ['B', 3, 2, np.nan], ['B', 5, 6, np.nan], ['B',5, 6, np.nan]],
columns=['a', 'b', 'c', 'd']
)
df[df.a == 'B'].c.mean()
4.666666666666667
In [17]: df.dtypes
Out[17]:
a object
b int64
c int64
d float64
dtype: object
I still can't imagine that this behavior is intended, so I believe it's worth opening an issue report on the pandas development page, but in general, you shouldn't be using object
dtype Series for numeric calculations.
Upvotes: 6