Search for elements by timestamp in a sorted pandas dataframe

Question

I have a very large pandas dataframe/series with millions of elements. And I need to find all the elements for which timestamp is < than t0. So normally what I would do is:

selected_df = df[df.index < t0]

This takes ages. As I understand when pandas searches it goes through every element of the dataframe. However I know that my dataframe is sorted hence I can break the loop as soon as the timestamp is > t0. I assume pandas doesn't know that dataframe is sorted and searches through all timestamps.

I have tried to use pandas.Series instead - still very slow. I have tried to write my own loop like:

boudery = 0
ticks_time_list = df.index
tsearch = ticks_time_list[0]
while tsearch < t0:
      tsearch = ticks_time_list[boudery]
      boudery += 1      
selected_df = df[:boudery]

This takes even longer than pandas search. The only solution I can see atm is to use Cython. Any ideas how this can be sorted without C involved?

DSM · Accepted Answer

It doesn't really seem to take ages for me, even with a long frame:

>>> df = pd.DataFrame({"A": 2, "B": 3}, index=pd.date_range("2001-01-01", freq="1 min", periods=10**7))
>>> len(df)
10000000
>>> %timeit df[df.index < "2001-09-01"]
100 loops, best of 3: 18.5 ms per loop

But if we're really trying to squeeze out every drop of performance, we can use the searchsorted method after dropping down to numpy:

>>> %timeit df.iloc[:df.index.values.searchsorted(np.datetime64("2001-09-01"))]
10000 loops, best of 3: 51.9 µs per loop
>>> df[df.index < "2001-09-01"].equals(df.iloc[:df.index.values.searchsorted(np.datetime64("2001-09-01"))])
True

which is many times faster.

Search for elements by timestamp in a sorted pandas dataframe

Answers (2)

Related Questions