Reputation: 495
I have a very large pandas dataframe/series with millions of elements. And I need to find all the elements for which timestamp is < than t0. So normally what I would do is:
selected_df = df[df.index < t0]
This takes ages. As I understand when pandas searches it goes through every element of the dataframe. However I know that my dataframe is sorted hence I can break the loop as soon as the timestamp is > t0. I assume pandas doesn't know that dataframe is sorted and searches through all timestamps.
I have tried to use pandas.Series instead - still very slow. I have tried to write my own loop like:
boudery = 0
ticks_time_list = df.index
tsearch = ticks_time_list[0]
while tsearch < t0:
tsearch = ticks_time_list[boudery]
boudery += 1
selected_df = df[:boudery]
This takes even longer than pandas search. The only solution I can see atm is to use Cython. Any ideas how this can be sorted without C involved?
Upvotes: 4
Views: 2707
Reputation: 353369
It doesn't really seem to take ages for me, even with a long frame:
>>> df = pd.DataFrame({"A": 2, "B": 3}, index=pd.date_range("2001-01-01", freq="1 min", periods=10**7))
>>> len(df)
10000000
>>> %timeit df[df.index < "2001-09-01"]
100 loops, best of 3: 18.5 ms per loop
But if we're really trying to squeeze out every drop of performance, we can use the searchsorted
method after dropping down to numpy
:
>>> %timeit df.iloc[:df.index.values.searchsorted(np.datetime64("2001-09-01"))]
10000 loops, best of 3: 51.9 µs per loop
>>> df[df.index < "2001-09-01"].equals(df.iloc[:df.index.values.searchsorted(np.datetime64("2001-09-01"))])
True
which is many times faster.
Upvotes: 5
Reputation: 4343
(I'm not very familiar with Pandas, but this describes a very generic idea - you should be able to apply it. If necessary, adapt the Pandas-specific functions.) You could try to use a more efficient search. At the moment you are using a linear search, going through all the elements. Instead, try this
ticks_time_list=df.index
tsearch_min = 0
tsearch_max = len(ticks_time_list)-1 #I'm not sure on whether this works on a pandas dataset
while True:
tsearch_middle = int((tsearch_max-tsearch_min)/2)
if ticks_time_list[tsearch_middle] < t0:
tsearch_min = tsearch_middle
else:
tsearch_max = tsearch_middle
if tsearch_max == tsearch_min:
break
# tsearch_max == tsearch_min and is the value of the index you are looking for
Instead of opening every single element, and looking at the time stamp, it instead tries to find the "boundary" by always narrowing down the search space by cutting it into half.
Upvotes: 0