mm441
mm441

Reputation: 495

Search for elements by timestamp in a sorted pandas dataframe

I have a very large pandas dataframe/series with millions of elements. And I need to find all the elements for which timestamp is < than t0. So normally what I would do is:

selected_df = df[df.index < t0]

This takes ages. As I understand when pandas searches it goes through every element of the dataframe. However I know that my dataframe is sorted hence I can break the loop as soon as the timestamp is > t0. I assume pandas doesn't know that dataframe is sorted and searches through all timestamps.

I have tried to use pandas.Series instead - still very slow. I have tried to write my own loop like:

boudery = 0
ticks_time_list = df.index
tsearch = ticks_time_list[0]
while tsearch < t0:
      tsearch = ticks_time_list[boudery]
      boudery += 1      
selected_df = df[:boudery]

This takes even longer than pandas search. The only solution I can see atm is to use Cython. Any ideas how this can be sorted without C involved?

Upvotes: 4

Views: 2707

Answers (2)

DSM
DSM

Reputation: 353369

It doesn't really seem to take ages for me, even with a long frame:

>>> df = pd.DataFrame({"A": 2, "B": 3}, index=pd.date_range("2001-01-01", freq="1 min", periods=10**7))
>>> len(df)
10000000
>>> %timeit df[df.index < "2001-09-01"]
100 loops, best of 3: 18.5 ms per loop

But if we're really trying to squeeze out every drop of performance, we can use the searchsorted method after dropping down to numpy:

>>> %timeit df.iloc[:df.index.values.searchsorted(np.datetime64("2001-09-01"))]
10000 loops, best of 3: 51.9 µs per loop
>>> df[df.index < "2001-09-01"].equals(df.iloc[:df.index.values.searchsorted(np.datetime64("2001-09-01"))])
True

which is many times faster.

Upvotes: 5

cleros
cleros

Reputation: 4343

(I'm not very familiar with Pandas, but this describes a very generic idea - you should be able to apply it. If necessary, adapt the Pandas-specific functions.) You could try to use a more efficient search. At the moment you are using a linear search, going through all the elements. Instead, try this

ticks_time_list=df.index
tsearch_min = 0
tsearch_max = len(ticks_time_list)-1 #I'm not sure on whether this works on a pandas dataset
while True:
    tsearch_middle = int((tsearch_max-tsearch_min)/2)
    if ticks_time_list[tsearch_middle] < t0:
        tsearch_min = tsearch_middle
    else:
        tsearch_max = tsearch_middle
    if tsearch_max == tsearch_min:
        break
# tsearch_max == tsearch_min and is the value of the index you are looking for

Instead of opening every single element, and looking at the time stamp, it instead tries to find the "boundary" by always narrowing down the search space by cutting it into half.

Upvotes: 0

Related Questions