Reputation: 112
I am having some trouble getting my code to run quickly.
After using a line by line profiler on my code, I have found that the following lines are where most of my inefficiencies come from:
import numpy as np
import datetime
timestamps = np.array(timestamps)
mask = (minTime <= timestamps) & (timestamps <= maxTime)
count = np.sum(mask)
timestamps
starts out as a list of datetimes, and minTime
is a single datetime.
An example value for timestamps:
minTime = datetime.datetime(2020, 5, 21, 2, 27, 26)
timestamps = [datetime.datetime(2020, 5, 21, 2, 27, 26), datetime.datetime(2020, 5, 21, 2, 27, 26),
datetime.datetime(2020, 5, 21, 2, 27, 26), datetime.datetime(2020, 5, 21, 2, 30, 55),
datetime.datetime(2020, 5, 21, 2, 30, 55), datetime.datetime(2020, 5, 21, 2, 30, 55),
datetime.datetime(2020, 5, 21, 2, 34, 26), datetime.datetime(2020, 5, 21, 2, 34, 26),
datetime.datetime(2020, 5, 21, 2, 34, 26), datetime.datetime(2020, 5, 21, 2, 39, 26),
datetime.datetime(2020, 5, 21, 2, 39, 26), datetime.datetime(2020, 5, 21, 2, 39, 26)]
Is there a more efficient way to rewrite the code above?
Any advice is appreciated.
Upvotes: 0
Views: 97
Reputation: 68146
Looks like numpy.datetime64
objects are pretty fast. About a 2x speed up from standard lib datetime
. Pandas kind of flounders here. It does a little bit better than what you see below if you use the pandas Timestamps as an index on Series object and use the .loc
accessor. But not that much better.
from datetime import datetime
import numpy
import pandas
py_dts = numpy.array([
datetime(2020, 5, 21, 2, 27, 26),
datetime(2020, 5, 21, 2, 27, 26),
datetime(2020, 5, 21, 2, 27, 26),
datetime(2020, 5, 21, 2, 30, 55),
datetime(2020, 5, 21, 2, 30, 55),
datetime(2020, 5, 21, 2, 30, 55),
datetime(2020, 5, 21, 2, 34, 26),
datetime(2020, 5, 21, 2, 34, 26),
datetime(2020, 5, 21, 2, 34, 26),
datetime(2020, 5, 21, 2, 39, 26),
datetime(2020, 5, 21, 2, 39, 26),
datetime(2020, 5, 21, 2, 39, 26)
])
min_pydt = datetime(2020, 5, 21, 2, 27, 26)
max_pydt = datetime(2020, 5, 21, 2, 39, 26)
min_npdt = numpy.datetime64(min_pydt)
max_npdt = numpy.datetime64(max_pydt)
min_pddt = pandas.Timestamp(min_pydt)
max_pddt = pandas.Timestamp(max_pydt)
np_64s = numpy.array([numpy.datetime64(d) for d in py_dts])
pd_tss = pandas.Series([pandas.Timestamp(d) for d in py_dts])
def counter(timestamps, mindt, maxdt):
return ((mindt <= timestamps) & (timestamps <= maxdt)).sum()
In the a Jupyter notebook I did:
%%timeit
counter(py_dts, min_pydt, max_pydt)
17.4 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
counter(np_64s, min_npdt, max_npdt)
7.42 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
counter(pd_tss, min_pddt, max_pddt)
531 µs ± 2.99 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Upvotes: 1