David
David

Reputation: 1267

Pandas selecting row by column value, strange behaviour

Ok, I have a pandas dataframe like this:

         lat      long    level        date    time    value
3341  29.232   -15.652     10.0  20100109.0   700.0      0.5
3342  27.887   -13.668    120.0  20100109.0   700.0      3.2
...
3899  26.345   -11.234      0.0  20100109.0   700.0      5.8

The reason of the strange number of the index is because it comes from a csv converted to pandas dataframe with some values filtered. Columns level, date, time are not really relevant.

I am trying, in ipython, to see the some rows filtering by latitude, so I do (if the dataframe is c):

c[c['lat'] == 26.345]

or

c.loc[c['lat'] == 26.345]

and I can see if the value is present or not, but sometimes it outputs nothing for latitude values that I am seeing in the dataframe !?! (For instance, I can see in the dataframe the value of latitude 27.702 and when I do c[c['lat'] == 27.702] or c.loc[c['lat'] == 27.702] I get an empty dataframe and I am seeing the value for such latitude). What is happening here?

Thank you.

Upvotes: 2

Views: 484

Answers (3)

Jonathan Eunice
Jonathan Eunice

Reputation: 22473

This is probably because you are asking for an exact match against floating point values, which is very, very dangerous. They are approximations, often printed to less precision than actually stored.

It's very easy to see 0.735471 printed, say, and think that's all there is, when in fact the value is really 0.73547122072282867; the display function has simply truncated the result. But when you try a strict equality test on the attractively short value, boom. Doesn't work.

Instead of

c[c['lat'] == 26.345]

Try:

import numpy as np

c[np.isclose(c['lat'], 26.345)]

Now you'll get values that are within a certain range of the value you specified. You can set the tolerance.

Upvotes: 4

3novak
3novak

Reputation: 2544

In addition to the answers addressing comparison on floating point values, some of the values in your lat column may be string type instead of numeric.

EDIT: You indicated that this is not the problem, but I'll leave this response here in case it helps someone else. :)

Use the to_numeric() function from pandas to convert them to numeric.

import pandas as pd

df['lat'] = pd.to_numeric(df['lat'])
# you can adjust the errors parameter as you need
df['lat'] = pd.to_numeric(df['lat'], errors='coerce')

Upvotes: 2

Ilya V. Schurov
Ilya V. Schurov

Reputation: 8067

It is a bit difficult to give a precise answer, as the question does not contain reproducible example, but let me try. Most probably, this is due floating point issues. It is possible that the number you see (and try to compare with) is not the same number that is stored in the memory due to rounding. For example:

import numpy as np
x = 0.1
arr = np.array([x + x + x])
print(np.array([x + x + x]))
# [ 0.3]
print(arr[arr == 0.3])
# []
print(x + x + x)
# 0.30000000000000004
# in fact 0.1 is not exactly equal to 1/10, 
# so 0.1 + 0.1 + 0.1 is not equal to 0.3

You can overcome this issue using np.isclose instead of ==:

print(np.isclose(arr, 0.3))
# [ True]
print(arr[np.isclose(arr, 0.3)])
# [ 0.3]

Upvotes: 2

Related Questions