smontanaro
smontanaro

Reputation: 1737

Getting Pandas NaT to propagate like NaN

I'm trying to take the min and max of a couple Pandas Series objects containing datetime64 data in the face of NaT. np.minimum and np.maximum work the way I want if the dtype is float64. That is, once any element in the comparison is NaN, NaN will be the result of that comparison. For example:

>>> s1
0    0.0
1    1.8
2    3.6
3    5.4
dtype: float64
>>> s2
0    10.0
1    17.0
2     NaN
3    14.0
dtype: float64
>>> np.maximum(s1, s2)
0    10.0
1    17.0
2     NaN
3    14.0
dtype: float64
>>> np.minimum(s1, s2)
0    0.0
1    1.8
2    NaN
3    5.4
dtype: float64

This doesn't work if s1 and s2 are datetime64 objects:

>>> s1
0   2199-12-31
1   2199-12-31
2   2199-12-31
3   2199-12-31
dtype: datetime64[ns]
>>> s2
0          NaT
1   2018-10-30 
2          NaT
3          NaT
dtype: datetime64[ns]
>>> np.maximum(s1, s2)
0   2199-12-31
1   2199-12-31
2   2199-12-31
3   2199-12-31
dtype: datetime64[ns]
>>> np.minimum(s1, s2)
0   2199-12-31
1   2018-10-30
2   2199-12-31
3   2199-12-31
dtype: datetime64[ns]

I expected indexes 0, 2 and 3 to turn up as NaT whether computing the min or max. (I realize numpy's functions might not have been the best choice, but I was not successful finding suitable Pandas analogs.)

After doing a bit of reading, I came to realize NaT is only approximately NaN, the latter having a proper floating point representation. Further reading suggested no simple way to have NaT "pollute" these comparisons. What's the correct way to get NaT to propagate in min/max comparisons the way NaN does in a floating point context? Maybe there are Pandas equivalents to numpy.{maximum,minimum} which are NaT-aware?

Upvotes: 5

Views: 1338

Answers (3)

jpp
jpp

Reputation: 164673

pd.Series.mask seems one solution which doesn't give up on vectorisation:

s1 = pd.Series([pd.datetime(2099, 12, 31)]*4)
s2 = pd.Series([pd.NaT, pd.datetime(2018, 10, 30), pd.NaT, pd.NaT])

null_check = s1.isnull() | s2.isnull()
res_max = np.maximum(s1, s2).mask(null_check, np.nan)
res_min = np.minimum(s1, s2).mask(null_check, np.nan)

print(res_max)
print(res_min)

0          NaT
1   2099-12-31
2          NaT
3          NaT
dtype: datetime64[ns]
0          NaT
1   2018-10-30
2          NaT
3          NaT
dtype: datetime64[ns]

As you found, the reason for the behaviour you see is that pd.NaT has an associated int value and this is used for comparison operations:

print(pd.to_numeric(pd.Series([pd.NaT])))

0   -9223372036854775808
dtype: int64

Upvotes: 2

smontanaro
smontanaro

Reputation: 1737

I believe I have this figured out. (Well, at least I figured out one way to skin the cat.) It's not terribly pretty, but it's much faster than my original solution embedding all the logic in apply(). Briefly, the solution involves translating the datetime elements to ints, mapping the int version of pd.NaT to np.nan, applying np.minimum/np.maximum, then translating back to datetime64. apply() is still involved, but the logic is much reduced from what I originally had. (No doubt it can still be improved. I'm not much of a Pandas/NumPy guy...)

>>> s1 = pd.Series([pd.NaT, pd.datetime(2018, 10, 30), pd.NaT, pd.NaT])
>>> s1
0          NaT
1   2018-10-30
2          NaT
3          NaT
dtype: datetime64[ns]
>>> nanish = int(pd.NaT)
>>> nanish
-9223372036854775808
>>> s2 = pd.to_numeric(s1)
>>> s2
0   -9223372036854775808
1    1540857600000000000
2   -9223372036854775808
3   -9223372036854775808
dtype: int64
>>> s3 = s2.apply(lambda x: np.nan if x == nanish else x)
>>> s3
0             NaN
1    1.540858e+18
2             NaN
3             NaN
dtype: float64
>>> s5 = np.maximum(s3, s4)
>>> s5
0             NaN
1    1.540858e+18
2             NaN
3             NaN
dtype: float64
>>> s6 = pd.to_datetime(s5)
>>> s6
0          NaT
1   2018-10-30
2          NaT
3          NaT
dtype: datetime64[ns]

Upvotes: 0

Ben.T
Ben.T

Reputation: 29635

Not sure it's the best way, but if you change the type of s1 and s2 to object with astype, then using np.minimum and np.maximum works and you still get a series of datetime64[ns], for example:

print (np.maximum(s1.astype(object), s2.astype(object)))
0          NaT
1   2199-12-31
2          NaT
3          NaT
Name: 1, dtype: datetime64[ns]

Upvotes: 0

Related Questions