CorgiGeek
CorgiGeek

Reputation: 15

Pandas numpy handling Nan

I have a pandas dataframe where I'm using the numpy library to determine whether the values in two columns are the same. I run into problems with the values in these columns being blank/NaN and the dataframe reporting that they are differnet although they both show as NaN. What is the best way to handle NaN when doing this analysis?

I have two files that I load into dataframes and then I merge the two on a unique key into the mathced dataframe and then do the following to find differences:

merged = pd.merge(left= df_price_source, right=df_price_secondary, on=['ACCT'], indicator=True, how='inner')
matched = merged[merged['_merge'] == 'both'].copy()
matched['date_comparison'] = np.where(matched["Start Date_x"] == matched["Start Date_y"], True, False)
matched['profit_comparison'] = np.where(matched["profit_x"] == matched["profit_y"], True, False)

For these new columns matched['date_comparison'] and matched[profit_comparison'] I see False in NaN columns but I would expect True.

Upvotes: 0

Views: 90

Answers (1)

richie
richie

Reputation: 181

Looking up numpy.NaN i found this SO answer.

the condition in the where method results in False when comparing 2 NaN

numpy.NaN == numpy.NaN
> False

to check if both are NaN you can use:

numpy.NaN is numpy.NaN
> True

you need to add the is check to the condition to recognize 2 NaN:

(matched["Start Date_x"] == matched["Start Date_y"]) or \
(matched["Start Date_x"] is matched["Start Date_y"])

you can also use a Set to check equality instead of the 2 conditions, as it works with NaN also:

 len({np.NaN, np.NaN}) == 1
 > True
 len({matched["Start Date_x"], matched["Start Date_y"]}) == 1

Upvotes: 0

Related Questions