Brian Huey
Brian Huey

Reputation: 1630

Inconsistent Nan Key Error using Pandas Apply

I'm recoding multiple columns in a dataframe and have come across a strange result that I can't quite figure out. I'm probably not recoding in the most efficient manner possible, but it's mostly the error that I'm hoping someone can explain.

s1 = pd.DataFrame([np.nan, '1', '2', '3', '4', '5'], columns=['col1'])
s2 = pd.DataFrame([np.nan, 1, 2, 3, 4, 5], columns=['col1'])
s1_dic = {np.nan: np.nan, '1': 1, '2':2, '3':3, '4':3, '5':3}
s2_dic = {np.nan: np.nan, 1: 1, 2:2, 3:3, 4:3, 5:3}
s1['col1'].apply(lambda x: s1_dic[x])
s2['col1'].apply(lambda x: s2_dic[x])

s1 works fine, but when I try to do the same thing with a list of integers and a np.nan, I get KeyError: nan which is confusing. Any help would be appreciated.

Upvotes: 3

Views: 2358

Answers (1)

Andy Hayden
Andy Hayden

Reputation: 375405

A workaround is to use the get dict method, rather than the lambda:

In [11]: s1['col1'].apply(s1_dic.get)
Out[11]:
0   NaN
1     1
2     2
3     3
4     3
5     3
Name: col1, dtype: float64

In [12]: s2['col1'].apply(s2_dic.get)
Out[12]:
0   NaN
1     1
2     2
3     3
4     3
5     3
Name: col1, dtype: float64

It's not clear to me right now why this is different...


Note: the dicts can be accessed by nan:

In [21]: s1_dic[np.nan]
Out[21]: nan

In [22]: s2_dic[np.nan]
Out[22]: nan

and hash(np.nan) == 0 so it's not that...


Update: Apparently the issue is with np.nan vs np.float64(np.nan), the former has np.nan is np.nan (because np.nan is bound to a specific instantiated nan object) whilst float('nan') is not float('nan'):

This means that get won't find float('nan'):

In [21]: nans = [float('nan') for _ in range(5)]

In [22]: {f: 1 for f in nans}
Out[22]: {nan: 1, nan: 1, nan: 1, nan: 1, nan: 1}

This means you can actually retrieve the nans from a dict, any such retrieval would be implementation specific! In fact, as the dict uses the id of these nans, this entire behavior above may be implementation specific (if nan shared the same id, as they may do in a REPL/ipython session).

You can catch the nullness beforehand:

In [31]: s2['col1'].apply(lambda x: s2_dic[x] if pd.notnull(x) else x)
Out[31]:
0   NaN
1     1
2     2
3     3
4     3
5     3
Name: col1, dtype: float64

But I think the original suggestion of using .get is a better option.

Upvotes: 2

Related Questions