Reputation: 4041
I am using Pandas Series to selection rows of a Series. However, I met a problem as following:
>>> q=pandas.Series([0.5,0.5,0,1,0.5,0.5])
>>> q
0 0.5
1 0.5
2 0.0
3 1.0
4 0.5
5 0.5
dtype: float64
>>> (q-0.3).abs()
0 0.2
1 0.2
2 0.3
3 0.7
4 0.2
5 0.2
dtype: float64
>>> (q-0.7).abs()
0 0.2
1 0.2
2 0.7
3 0.3
4 0.2
5 0.2
dtype: float64
>>> (q-0.3).abs() > (q-0.7).abs() # This is I expected:
0 True # False
1 True # False
2 False # False
3 True # True
4 True # False
5 True # False
dtype: bool
>>> (q-0.3).abs() == (q-0.7).abs()
0 False
1 False
2 False
3 False
4 False
5 False
dtype: bool
Apparently, "0.2" is not greater than "0.2"......
Why is the output different from what I expect?
Upvotes: 1
Views: 159
Reputation: 375925
Andy's answer is spot on for the reason (this is a floating point issue, and also an issue of how pandas truncates floating points when printing in a Series/DataFrame...).
You might like to use the numpy function isclose
:
In [11]: a = (q-0.3).abs()
In [12]: b = (q-0.7).abs()
In [13]: np.isclose(a, b)
Out[13]: array([ True, True, False, False, True, True], dtype=bool)
I don't think there's a native pandas function to do this, happy to be called out on that...
This has a default tolerance (atol
) of 1e-8, so it may make sense for us to use that when testing greater than (to get your desired result):
In [14]: a > b + 1e-8
Out[14]:
0 False
1 False
2 False
3 True
4 False
5 False
dtype: bool
Update: Just to comment further on the performance aspect, we see float64 are 1000 times faster for a Series with 6000 elements (this gets worse as length increases):
In [21]: q = pd.Series([0.5, 0.5, 0, 1, 0.5, 0.5] * 1000)
In [22]: %timeit a = (q-0.3).abs(); b = (q-0.7).abs(); a > b + 1e-8
1000 loops, best of 3: 726 µs per loop
In [23]: dec_s = q.apply(Decimal)
In [24]: %timeit (dec_s-Decimal(0.3)).abs() > (dec_s-Decimal(0.7)).abs()
1 loops, best of 3: 915 ms per loop
The difference is even starker with more elements:
In [31]: q = pd.Series([0.5, 0.5 ,0, 1, 0.5, 0.5] * 10000)
In [32]: %timeit a = (q-0.3).abs(); b = (q-0.7).abs(); a > b + 1e-8
1000 loops, best of 3: 1.5 ms per loop
In [33]: dec_s = q.apply(Decimal)
In [34]: %timeit (dec_s-Decimal(0.3)).abs() > (dec_s-Decimal(0.7)).abs()
1 loops, best of 3: 9.16 s per loop
Upvotes: 1
Reputation: 50640
This is a floating point problem. It is described very well in this question.
To directly answer your problem, look at the first element of your two tests. Your values are not equal.
>>> (q-0.7).abs()[1]
0.19999999999999996
>>> (q-0.3).abs()[1]
0.20000000000000001
We can get your results though, with a little bit of manipulation and by utilizing the decimal
module.
>>> from decimal import Decimal, getcontext
>>> import pandas
>>> s = [0.5,0.5,0,1,0.5,0.5]
>>> dec_s = [Decimal(x) for x in s]
>>> q = pandas.Series(dec_s)
>>> q
0 0.5
1 0.5
2 0
3 1
4 0.5
5 0.5
dtype: object
>>> getcontext().prec
28
>>> getcontext().prec = 2
>>> (q-Decimal(0.3)).abs() > (q-Decimal(0.7)).abs()
0 False
1 False
2 False
3 True
4 False
5 False
dtype: bool
A few things to note:
float
to decimal
data types before being added to the Series
.dtype
is now an object
instead of float64
. This is because numpy doesn't handle Decimal types directly.0.3
and 0.7
values used in the comparison must also be Decimals, otherwise you will see an error similar to unsupported operand type(s) for +: 'Decimal' and 'float'
. Upvotes: 1