Reputation: 1737
I'm trying to take the min and max of a couple Pandas Series objects containing datetime64 data in the face of NaT. np.minimum and np.maximum work the way I want if the dtype is float64. That is, once any element in the comparison is NaN, NaN will be the result of that comparison. For example:
>>> s1
0 0.0
1 1.8
2 3.6
3 5.4
dtype: float64
>>> s2
0 10.0
1 17.0
2 NaN
3 14.0
dtype: float64
>>> np.maximum(s1, s2)
0 10.0
1 17.0
2 NaN
3 14.0
dtype: float64
>>> np.minimum(s1, s2)
0 0.0
1 1.8
2 NaN
3 5.4
dtype: float64
This doesn't work if s1 and s2 are datetime64 objects:
>>> s1
0 2199-12-31
1 2199-12-31
2 2199-12-31
3 2199-12-31
dtype: datetime64[ns]
>>> s2
0 NaT
1 2018-10-30
2 NaT
3 NaT
dtype: datetime64[ns]
>>> np.maximum(s1, s2)
0 2199-12-31
1 2199-12-31
2 2199-12-31
3 2199-12-31
dtype: datetime64[ns]
>>> np.minimum(s1, s2)
0 2199-12-31
1 2018-10-30
2 2199-12-31
3 2199-12-31
dtype: datetime64[ns]
I expected indexes 0, 2 and 3 to turn up as NaT whether computing the min or max. (I realize numpy's functions might not have been the best choice, but I was not successful finding suitable Pandas analogs.)
After doing a bit of reading, I came to realize NaT is only approximately NaN, the latter having a proper floating point representation. Further reading suggested no simple way to have NaT "pollute" these comparisons. What's the correct way to get NaT to propagate in min/max comparisons the way NaN does in a floating point context? Maybe there are Pandas equivalents to numpy.{maximum,minimum} which are NaT-aware?
Upvotes: 5
Views: 1338
Reputation: 164673
pd.Series.mask
seems one solution which doesn't give up on vectorisation:
s1 = pd.Series([pd.datetime(2099, 12, 31)]*4)
s2 = pd.Series([pd.NaT, pd.datetime(2018, 10, 30), pd.NaT, pd.NaT])
null_check = s1.isnull() | s2.isnull()
res_max = np.maximum(s1, s2).mask(null_check, np.nan)
res_min = np.minimum(s1, s2).mask(null_check, np.nan)
print(res_max)
print(res_min)
0 NaT
1 2099-12-31
2 NaT
3 NaT
dtype: datetime64[ns]
0 NaT
1 2018-10-30
2 NaT
3 NaT
dtype: datetime64[ns]
As you found, the reason for the behaviour you see is that pd.NaT
has an associated int
value and this is used for comparison operations:
print(pd.to_numeric(pd.Series([pd.NaT])))
0 -9223372036854775808
dtype: int64
Upvotes: 2
Reputation: 1737
I believe I have this figured out. (Well, at least I figured out one way to skin the cat.) It's not terribly pretty, but it's much faster than my original solution embedding all the logic in apply(). Briefly, the solution involves translating the datetime elements to ints, mapping the int version of pd.NaT to np.nan, applying np.minimum/np.maximum, then translating back to datetime64. apply() is still involved, but the logic is much reduced from what I originally had. (No doubt it can still be improved. I'm not much of a Pandas/NumPy guy...)
>>> s1 = pd.Series([pd.NaT, pd.datetime(2018, 10, 30), pd.NaT, pd.NaT])
>>> s1
0 NaT
1 2018-10-30
2 NaT
3 NaT
dtype: datetime64[ns]
>>> nanish = int(pd.NaT)
>>> nanish
-9223372036854775808
>>> s2 = pd.to_numeric(s1)
>>> s2
0 -9223372036854775808
1 1540857600000000000
2 -9223372036854775808
3 -9223372036854775808
dtype: int64
>>> s3 = s2.apply(lambda x: np.nan if x == nanish else x)
>>> s3
0 NaN
1 1.540858e+18
2 NaN
3 NaN
dtype: float64
>>> s5 = np.maximum(s3, s4)
>>> s5
0 NaN
1 1.540858e+18
2 NaN
3 NaN
dtype: float64
>>> s6 = pd.to_datetime(s5)
>>> s6
0 NaT
1 2018-10-30
2 NaT
3 NaT
dtype: datetime64[ns]
Upvotes: 0
Reputation: 29635
Not sure it's the best way, but if you change the type of s1 and s2 to object
with astype
, then using np.minimum
and np.maximum
works and you still get a series of datetime64[ns]
, for example:
print (np.maximum(s1.astype(object), s2.astype(object)))
0 NaT
1 2199-12-31
2 NaT
3 NaT
Name: 1, dtype: datetime64[ns]
Upvotes: 0