Federico Caccia
Federico Caccia

Reputation: 1897

how to improve searching index in dataframe

Given a pandas dataframe with a timestamp index, sorted. I have a label and I need to find the closest index to that label. Also, I need to find a smaller timestamp, so the search should be computed in the minor timestamps. Here is my code:

import pandas as pd
import datetime

data = [i for i in range(100)]
dates = pd.date_range(start="01-01-2018", freq="min", periods=100)
dataframe = pd.DataFrame(data, dates)

label = "01-01-2018 00:10:01"
method = "pad"
tol = datetime.timedelta(seconds=60)
idx = dataframe.index.get_loc(key=label, method="pad", tolerance=tol)

print("Closest idx:"+str(idx))
print("Closest date:"+str(dataframe.index[idx]))

the searching is too slow. Is there a way to improve it?

Upvotes: 2

Views: 618

Answers (1)

user3483203
user3483203

Reputation: 51185

To improve performance, I recommend a transformation of what you're searching. Instead of using get_loc, you can convert your DateTimeIndex to Unix Time, and use np.searchsorted on the underlying numpy array (As the name implies, this requires a sorted index).


get_loc:

(Your current approach)

label = "01-01-2018 00:10:01"
tol = datetime.timedelta(seconds=60)
idx = dataframe.index.get_loc(key=label, method="pad", tolerance=tol)
print(dataframe.iloc[idx])

0    10
Name: 2018-01-01 00:10:00, dtype: int64

And it's timings:

%timeit dataframe.index.get_loc(key=label, method="pad", tolerance=tol)
2.03 ms ± 81.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

np.searchsorted:

arr = df.index.astype(int)//10**9
l = pd.to_datetime(label).timestamp()
idx = np.max(np.searchsorted(arr, l, side='left')-1, 0)
print(dataframe.iloc[idx])

0    10
Name: 2018-01-01 00:10:00, dtype: int64

And the timings:

%timeit np.max(np.searchsorted(arr, l, side='left')-1, 0)
56.6 µs ± 979 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

(I didn't include the setup costs, because the initial array creation should be something you do once, then use for every single query, but even if I did include the setup costs, this method is faster):

%%timeit
arr = df.index.astype(int)//10**9
l = pd.to_datetime(label).timestamp()
np.max(np.searchsorted(arr, l, side='left')-1, 0)

394 µs ± 3.84 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The above method does not enforce a tolerance of 60s, although this is trivial to check:

>>> np.abs(arr[idx]-l)<60
True

Upvotes: 2

Related Questions