Reputation: 9078
I have numpy array of strings (p.s. why is string represented as object?!)
t = array(['21/02/2014 08:40:00 AM', '11/02/2014 10:50:00 PM',
'07/04/2014 05:50:00 PM', '17/02/2014 10:20:00 PM',
'07/03/2014 06:10:00 AM', '02/03/2014 12:25:00 PM',
'05/02/2014 03:20:00 AM', '31/01/2014 12:30:00 AM',
'28/02/2014 01:25:00 PM'], dtype=object)
I would like to convert it to numpy.datetime64 with day resolution, however the only solution I found is:
t = [datetime.strptime(tt,"%d/%m/%Y %H:%M:%S %p") for tt in t]
t = np.array(t,dtype='datetime64[us]').astype('datetime64[D]')
Can it get uglier than that? Why do I need to go through native Python list? There must be another way...
By the way, I cannot find a way to plot an histogram of dates in numpy/pandas
Upvotes: 3
Views: 5680
Reputation: 180532
The date format is the problem, 01/01/2015
is ambiguous, if it was in ISO 8601 you could parse it directly using numpy, in your case since you only want the date then splitting and rearranging the data will be significantly faster:
t = np.array([datetime.strptime(d.split(None)[0], "%d/%m/%Y")
for d in t],dtype='datetime64[us]').astype('datetime64[D]')
Some timings, first rearranging after parsing:
In [36]: %%timeit
from datetime import datetime
t = np.array(['21/02/2014 08:40:00', '11/02/2014 10:50:00 PM',
'07/04/2014 05:50:00 PM', '17/02/2014 10:20:00 PM',
'07/03/2014 06:10:00 AM', '02/03/2014 12:25:00 PM',
'05/02/2014 03:20:00 AM', '31/01/2014 12:30:00 AM',
'28/02/2014 01:25:00 PM']*10000)
t1 = np.array([np.datetime64("{}-{}-{}".format(c[:4], b, a)) for a, b, c in (s.split("/", 2) for s in t)])
....:
10 loops, best of 3: 125 ms per loop
Your code:
In [37]: %%timeit
from datetime import datetime
t = np.array(['21/02/2014 08:40:00 AM', '11/02/2014 10:50:00 PM',
'07/04/2014 05:50:00 PM', '17/02/2014 10:20:00 PM',
'07/03/2014 06:10:00 AM', '02/03/2014 12:25:00 PM',
'05/02/2014 03:20:00 AM', '31/01/2014 12:30:00 AM',
'28/02/2014 01:25:00 PM']*10000)
t = [datetime.strptime(tt,"%d/%m/%Y %H:%M:%S %p") for tt in t]
t = np.array(t,dtype='datetime64[us]').astype('datetime64[D]')
....:
1 loops, best of 3: 1.56 s per loop
A dramatic difference with both giving the same result:
In [48]: t = np.array(['21/02/2014 08:40:00 AM', '11/02/2014 10:50:00 PM',
'07/04/2014 05:50:00 PM', '17/02/2014 10:20:00 PM',
'07/03/2014 06:10:00 AM', '02/03/2014 12:25:00 PM',
'05/02/2014 03:20:00 AM', '31/01/2014 12:30:00 AM',
'28/02/2014 01:25:00 PM'] * 10000)
In [49]: t1 = [datetime.strptime(tt,"%d/%m/%Y %H:%M:%S %p") for tt in t]
t1 = np.array(t1,dtype='datetime64[us]').astype('datetime64[D]')
....:
In [50]: t2 = np.array([np.datetime64("{}-{}-{}".format(c[:4], b, a)) for a, b, c in (s.split("/", 2) for s in t)])
In [51]: (t1 == t2).all()
Out[51]: True
Upvotes: 1