albert
albert

Reputation: 8613

pandas: check whether an element is in dataframe or given column leads to strange results

I am doing some data handling based on a DataFrame with the shape of (135150, 12) so double checking my results manually is not applicable anymore.

I encountered some 'strange' behavior when I tried to check if an element is part of the dataframe or a given column.

This behavior is reproducible with even smaller dataframes as follows:

import numpy as np
import pandas as pd    

start = 1e-3
end = 2e-3
step = 0.01e-3
arr = np.arange(start, end+step, step)

val = 0.0019

df = pd.DataFrame(arr, columns=['example_value'])

print(val in df) # prints `False`
print(val in df['example_value']) # prints `True`
print(val in df.values) # prints `False`
print(val in df['example_value'].values) # prints `False`
print(df['example_value'].isin([val]).any()) # prints `False`

Since I am a very beginner in data analysis I am not able to explain this behavior.

I know that I am using different approaches involving different datatypes (like pd.Series, np.ndarray or np.array) in order to check if the given value exists in the dataframe. Additionally when using np.array or np.ndarray the machine accuracy comes in play which I am aware of in mind.

However, at the end, I need to implement several functions to filter the dataframe and count the occurrences of some values, which I have done several times before based on boolean columns in combination with performed operations like > and < successfully.

But in this case I need to filter by the exact value and count its occurrences which after all lead me to the issue described above.

So could anyone explain, what's going on here?

Upvotes: 2

Views: 977

Answers (1)

Randy
Randy

Reputation: 14847

The underlying issue, as Divakar suggested, is floating point precision. Because DataFrames/Series are built on top of numpy, there isn't really a penalty for using numpy methods though, so you can just do something like:

df['example_value'].apply(lambda x: np.isclose(x, val)).any()

or

np.isclose(df['example_value'], val).any()

both of which correctly return True.

Upvotes: 3

Related Questions