Emaad Shamsi
Emaad Shamsi

Reputation: 112

What is more efficient that np.sum and numpy boolean operators?

I am having some trouble getting my code to run quickly.

After using a line by line profiler on my code, I have found that the following lines are where most of my inefficiencies come from:

import numpy as np
import datetime

timestamps = np.array(timestamps)
mask = (minTime <= timestamps) & (timestamps <= maxTime)
count = np.sum(mask)

timestamps starts out as a list of datetimes, and minTime is a single datetime.

An example value for timestamps:

minTime = datetime.datetime(2020, 5, 21, 2, 27, 26)

timestamps = [datetime.datetime(2020, 5, 21, 2, 27, 26), datetime.datetime(2020, 5, 21, 2, 27, 26), 
 datetime.datetime(2020, 5, 21, 2, 27, 26), datetime.datetime(2020, 5, 21, 2, 30, 55),
 datetime.datetime(2020, 5, 21, 2, 30, 55), datetime.datetime(2020, 5, 21, 2, 30, 55),
 datetime.datetime(2020, 5, 21, 2, 34, 26), datetime.datetime(2020, 5, 21, 2, 34, 26),
 datetime.datetime(2020, 5, 21, 2, 34, 26), datetime.datetime(2020, 5, 21, 2, 39, 26),
 datetime.datetime(2020, 5, 21, 2, 39, 26), datetime.datetime(2020, 5, 21, 2, 39, 26)]

Is there a more efficient way to rewrite the code above?

Any advice is appreciated.

Upvotes: 0

Views: 97

Answers (1)

Paul H
Paul H

Reputation: 68146

Looks like numpy.datetime64 objects are pretty fast. About a 2x speed up from standard lib datetime. Pandas kind of flounders here. It does a little bit better than what you see below if you use the pandas Timestamps as an index on Series object and use the .loc accessor. But not that much better.

from datetime import datetime

import numpy
import pandas


py_dts = numpy.array([
    datetime(2020, 5, 21, 2, 27, 26),
    datetime(2020, 5, 21, 2, 27, 26), 
    datetime(2020, 5, 21, 2, 27, 26),
    datetime(2020, 5, 21, 2, 30, 55),
    datetime(2020, 5, 21, 2, 30, 55),
    datetime(2020, 5, 21, 2, 30, 55),
    datetime(2020, 5, 21, 2, 34, 26),
    datetime(2020, 5, 21, 2, 34, 26),
    datetime(2020, 5, 21, 2, 34, 26),
    datetime(2020, 5, 21, 2, 39, 26),
    datetime(2020, 5, 21, 2, 39, 26),
    datetime(2020, 5, 21, 2, 39, 26)
])

min_pydt = datetime(2020, 5, 21, 2, 27, 26)
max_pydt = datetime(2020, 5, 21, 2, 39, 26)

min_npdt = numpy.datetime64(min_pydt)
max_npdt = numpy.datetime64(max_pydt)

min_pddt = pandas.Timestamp(min_pydt)
max_pddt = pandas.Timestamp(max_pydt)

np_64s = numpy.array([numpy.datetime64(d) for d in py_dts])
pd_tss = pandas.Series([pandas.Timestamp(d) for d in py_dts])


def counter(timestamps, mindt, maxdt):    
    return ((mindt <= timestamps) & (timestamps <= maxdt)).sum()

In the a Jupyter notebook I did:

%%timeit
counter(py_dts, min_pydt, max_pydt)

17.4 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
counter(np_64s, min_npdt, max_npdt)

7.42 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
counter(pd_tss, min_pddt, max_pddt)

531 µs ± 2.99 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Upvotes: 1

Related Questions