sds
sds

Reputation: 60064

Python comparison ignoring nan

While nan == nan is always False, in many cases people want to treat them as equal, and this is enshrined in pandas.DataFrame.equals:

NaNs in the same location are considered equal.

Of course, I can write

def equalp(x, y):
    return (x == y) or (math.isnan(x) and math.isnan(y))

However, this will fail on containers like [float("nan")] and isnan barfs on non-numbers (so the complexity increases).

So, what do people do to compare complex Python objects which may contain nan?

PS. Motivation: when comparing two rows in a pandas DataFrame, I would convert them into dicts and compare dicts element-wise.

PPS. When I say "compare", I am thinking diff, not equalp.

Upvotes: 17

Views: 22951

Answers (3)

Shaun Taylor
Shaun Taylor

Reputation: 326

Here's a function that recurses into a data structure replacing nan values with a unique string. I wrote this for a unit test that compares data structures that may contain nan.

It's only designed for data structures made of dict and list, but it's easy to see how to expand it.

from math import isnan
from uuid import uuid4
from typing import Union

NAN_REPLACEMENT = f"THIS_WAS_A_NAN{uuid4()}"

def replace_nans(data_structure: Union[dict, list]) -> Union[dict, list]:
    if isinstance(data_structure, dict):
        iterme = data_structure.items()
    elif isinstance(data_structure, list):
        iterme = enumerate(data_structure)
    else:
        raise ValueError(
            "replace_nans should only be called on structures made of dicts and lists"
        )

    for key, value in iterme:
        if isinstance(value, float) and isnan(value):
            data_structure[key] = NAN_REPLACEMENT
        elif isinstance(value, dict) or isinstance(value, list):
            data_structure[key] = replace_nans(data_structure[key])
    return data_structure

Upvotes: 0

juanpa.arrivillaga
juanpa.arrivillaga

Reputation: 96277

Suppose you have a data-frame with nan values:

In [10]: df = pd.DataFrame(np.random.randint(0, 20, (10, 10)).astype(float), columns=["c%d"%d for d in range(10)])

In [10]: df.where(np.random.randint(0,2, df.shape).astype(bool), np.nan, inplace=True)

In [10]: df
Out[10]:
     c0    c1    c2    c3    c4    c5    c6    c7   c8    c9
0   NaN   6.0  14.0   NaN   5.0   NaN   2.0  12.0  3.0   7.0
1   NaN   6.0   5.0  17.0   NaN   NaN  13.0   NaN  NaN   NaN
2   NaN  17.0   NaN   8.0   6.0   NaN   NaN  13.0  NaN   NaN
3   3.0   NaN   NaN  15.0   NaN   8.0   3.0   NaN  3.0   NaN
4   7.0   8.0   7.0   NaN   9.0  19.0   NaN   0.0  NaN  11.0
5   NaN   NaN  14.0   2.0   NaN   NaN   0.0   NaN  NaN   8.0
6   3.0  13.0   NaN   NaN   NaN   NaN   NaN  12.0  3.0   NaN
7  13.0  14.0   NaN   5.0  13.0   NaN  18.0   6.0  NaN   5.0
8   3.0   9.0  14.0  19.0  11.0   NaN   NaN   NaN  NaN   5.0
9   3.0  17.0   NaN   NaN   0.0   NaN  11.0   NaN  NaN   0.0

And you want to compare rows, say, row 0 and 8. Then just use fillna and do vectorized comparison:

In [12]: df.iloc[0,:].fillna(0) != df.iloc[8,:].fillna(0)
Out[12]:
c0     True
c1     True
c2    False
c3     True
c4     True
c5    False
c6     True
c7     True
c8     True
c9     True
dtype: bool

You can use the resulting boolean array to index into the columns, if you just want to know which columns are different:

In [14]: df.columns[df.iloc[0,:].fillna(0) != df.iloc[8,:].fillna(0)]
Out[14]: Index(['c0', 'c1', 'c3', 'c4', 'c6', 'c7', 'c8', 'c9'], dtype='object')

Upvotes: 10

ascripter
ascripter

Reputation: 6233

I assume you have array-data or can at least convert to a numpy array?

One way is to mask all the nans using a numpy.maarray, then comparing the arrays. So your starting situation would be sth. like this

import numpy as np
import numpy.ma as ma
arr1 = ma.array([3,4,6,np.nan,2])
arr2 = ma.array([3,4,6,np.nan,2])

print arr1 == arr2
print ma.all(arr1==arr2)

>>> [ True  True  True False  True]
>>> False  # <-- you want this to show True

Solution:

arr1[np.isnan(arr1)] = ma.masked
arr2[np.isnan(arr2)] = ma.masked

print arr1 == arr2
print ma.all(arr1==arr2)

>>> [True True True -- True]
>>> True

Upvotes: 4

Related Questions