Munichong
Munichong

Reputation: 4041

Selection of a Series in Python Pandas

I am using Pandas Series to selection rows of a Series. However, I met a problem as following:

>>> q=pandas.Series([0.5,0.5,0,1,0.5,0.5])
>>> q
0    0.5
1    0.5
2    0.0
3    1.0
4    0.5
5    0.5
dtype: float64

>>> (q-0.3).abs()
0    0.2
1    0.2
2    0.3
3    0.7
4    0.2
5    0.2
dtype: float64

>>> (q-0.7).abs()
0    0.2
1    0.2
2    0.7
3    0.3
4    0.2
5    0.2
dtype: float64

>>> (q-0.3).abs() > (q-0.7).abs()          # This is I expected:
0     True                                 # False
1     True                                 # False
2    False                                 # False
3     True                                 # True
4     True                                 # False
5     True                                 # False
dtype: bool

>>> (q-0.3).abs() == (q-0.7).abs()
0    False
1    False
2    False
3    False
4    False
5    False
dtype: bool

Apparently, "0.2" is not greater than "0.2"......

Why is the output different from what I expect?

Upvotes: 1

Views: 159

Answers (2)

Andy Hayden
Andy Hayden

Reputation: 375925

Andy's answer is spot on for the reason (this is a floating point issue, and also an issue of how pandas truncates floating points when printing in a Series/DataFrame...).

You might like to use the numpy function isclose:

In [11]: a = (q-0.3).abs()

In [12]: b = (q-0.7).abs()

In [13]: np.isclose(a, b)
Out[13]: array([ True,  True, False, False,  True,  True], dtype=bool)

I don't think there's a native pandas function to do this, happy to be called out on that...

This has a default tolerance (atol) of 1e-8, so it may make sense for us to use that when testing greater than (to get your desired result):

In [14]: a > b + 1e-8
Out[14]:
0    False
1    False
2    False
3     True
4    False
5    False
dtype: bool

Update: Just to comment further on the performance aspect, we see float64 are 1000 times faster for a Series with 6000 elements (this gets worse as length increases):

In [21]: q = pd.Series([0.5, 0.5, 0, 1, 0.5, 0.5] * 1000)

In [22]: %timeit a = (q-0.3).abs(); b = (q-0.7).abs(); a > b + 1e-8
1000 loops, best of 3: 726 µs per loop

In [23]: dec_s = q.apply(Decimal)

In [24]: %timeit (dec_s-Decimal(0.3)).abs() > (dec_s-Decimal(0.7)).abs()
1 loops, best of 3: 915 ms per loop

The difference is even starker with more elements:

In [31]: q = pd.Series([0.5, 0.5 ,0, 1, 0.5, 0.5] * 10000)

In [32]: %timeit a = (q-0.3).abs(); b = (q-0.7).abs(); a > b + 1e-8
1000 loops, best of 3: 1.5 ms per loop

In [33]: dec_s = q.apply(Decimal)

In [34]: %timeit (dec_s-Decimal(0.3)).abs() > (dec_s-Decimal(0.7)).abs()
1 loops, best of 3: 9.16 s per loop

Upvotes: 1

Andy
Andy

Reputation: 50640

This is a floating point problem. It is described very well in this question.

To directly answer your problem, look at the first element of your two tests. Your values are not equal.

>>> (q-0.7).abs()[1]
0.19999999999999996
>>> (q-0.3).abs()[1]
0.20000000000000001

We can get your results though, with a little bit of manipulation and by utilizing the decimal module.

>>> from decimal import Decimal, getcontext
>>> import pandas
>>> s = [0.5,0.5,0,1,0.5,0.5]
>>> dec_s = [Decimal(x) for x in s]
>>> q = pandas.Series(dec_s)
>>> q
0    0.5
1    0.5
2      0
3      1
4    0.5
5    0.5
dtype: object
>>> getcontext().prec
28
>>> getcontext().prec = 2
>>> (q-Decimal(0.3)).abs() > (q-Decimal(0.7)).abs()
0    False
1    False
2    False
3     True
4    False
5    False
dtype: bool

A few things to note:

  • The list of values is converted from float to decimal data types before being added to the Series.
  • The dtype is now an object instead of float64. This is because numpy doesn't handle Decimal types directly.
  • The default precision of the decimal type of 28 places after the decimal. I've chopped it to 2. Normally the decimal module can handle this automatically, but with the numpy interaction (I assume), it gets confused and we end up with large float like numbers. The smaller precision matches your data set.
  • The 0.3 and 0.7 values used in the comparison must also be Decimals, otherwise you will see an error similar to unsupported operand type(s) for +: 'Decimal' and 'float'.

Upvotes: 1

Related Questions