Reputation: 17794
I have data with time intervals:
Start End
0 2022-01-01 00:00 2022-01-03 00:00
1 2022-02-01 00:00 2022-03-01 03:00
2 2022-04-01 11:00 2022-04-10 13:00
3 2022-08-01 12:00 2022-08-07 17:00
Having a list of time points, I select intervals that contain a certain Timestamp using boolean indexing:
result = []
for t in ['2022-01-02 00:00', '2022-04-01 11:00', '2022-08-01 12:00']:
t = pd.Timestamp(t)
df_sel = df[df['Start'].le(t) & df['End'].gt(t)]
result.append(df_sel)
Resulting data frames have different lengths. What is the most efficient way to do boolean indexing with time data? Is it better to use NumPy or some other dtypes? How can I speed up my solution?
Upvotes: 2
Views: 121
Reputation: 50278
Is it better to use NumPy or some other dtypes?
What is the most efficient way to do boolean indexing with time data?
Pandas uses Numpy internally for most functions. Numpy is not optimal but pretty good for this.
The thing is the solution is inefficient because it iterate over the whole dataframe for each t
value in the loop. The resulting complexity is O(m n)
where m
is len(df)
and n
is the number of timestamps. For n=3 values, this solution can hardly be optimized using Pandas/Numpy but it can be made a bit faster with Numba/Cython which are a bit more low-level. I assume you have more items in practice.
How can I speed up my solution?
This kind of problem can be solve using an interval tree. The idea is to add the start-end indices to the interval tree and then query the tree with each searched timestamp. Regarding the exact interval tree data structure used, you can store the indices so to find out all the rows. Note that the complexity is O(n log m + k)
where k
is the number of row found. If there is too many values in output, then there is no fast solution because Pandas indexing is in O(k)
anyway. Note that such data structure is not provided by standard Python module but there are some external module that appear to do that like this one.
If n is huge, an alternative solution is to sort the timestamps, put them in a Numpy array, and do a binary search (np.searchsorted
) so to know for each row the timestamps that are included. The binary search can be vectorized so to search many values at a time. The appending can hardly be vectorized though. This solution runs in O(m log n)
.
Upvotes: 1