Reputation: 1630
I'm recoding multiple columns in a dataframe and have come across a strange result that I can't quite figure out. I'm probably not recoding in the most efficient manner possible, but it's mostly the error that I'm hoping someone can explain.
s1 = pd.DataFrame([np.nan, '1', '2', '3', '4', '5'], columns=['col1'])
s2 = pd.DataFrame([np.nan, 1, 2, 3, 4, 5], columns=['col1'])
s1_dic = {np.nan: np.nan, '1': 1, '2':2, '3':3, '4':3, '5':3}
s2_dic = {np.nan: np.nan, 1: 1, 2:2, 3:3, 4:3, 5:3}
s1['col1'].apply(lambda x: s1_dic[x])
s2['col1'].apply(lambda x: s2_dic[x])
s1 works fine, but when I try to do the same thing with a list of integers and a np.nan, I get KeyError: nan
which is confusing. Any help would be appreciated.
Upvotes: 3
Views: 2358
Reputation: 375405
A workaround is to use the get dict method, rather than the lambda:
In [11]: s1['col1'].apply(s1_dic.get)
Out[11]:
0 NaN
1 1
2 2
3 3
4 3
5 3
Name: col1, dtype: float64
In [12]: s2['col1'].apply(s2_dic.get)
Out[12]:
0 NaN
1 1
2 2
3 3
4 3
5 3
Name: col1, dtype: float64
It's not clear to me right now why this is different...
Note: the dicts can be accessed by nan:
In [21]: s1_dic[np.nan]
Out[21]: nan
In [22]: s2_dic[np.nan]
Out[22]: nan
and hash(np.nan) == 0
so it's not that...
Update: Apparently the issue is with np.nan
vs np.float64(np.nan)
, the former has np.nan is np.nan
(because np.nan
is bound to a specific instantiated nan object) whilst float('nan') is not float('nan')
:
This means that get won't find float('nan')
:
In [21]: nans = [float('nan') for _ in range(5)]
In [22]: {f: 1 for f in nans}
Out[22]: {nan: 1, nan: 1, nan: 1, nan: 1, nan: 1}
This means you can actually retrieve the nans from a dict, any such retrieval would be implementation specific! In fact, as the dict uses the id of these nans, this entire behavior above may be implementation specific (if nan shared the same id, as they may do in a REPL/ipython session).
You can catch the nullness beforehand:
In [31]: s2['col1'].apply(lambda x: s2_dic[x] if pd.notnull(x) else x)
Out[31]:
0 NaN
1 1
2 2
3 3
4 3
5 3
Name: col1, dtype: float64
But I think the original suggestion of using .get is a better option.
Upvotes: 2