Python/Pandas: extract intervals from a large dataframe

Question

I have two pandas DataFrames:

20 million rows of continues time series data with DateTime Index (df) IMG
20 thousand rows with two timestamps (df_seq) IMG

I want to use the second Dataframe to extract all sequences out of the first (all rows of the first between the two timestamps for each row of 2. ), then each sequence needs to be transposed into 990 columns and then all sequences have to be combined in a new DataFrame.

So the new DataFrame has one row with 990 columns for each sequence IMG (case row get added later).

Right now my code looks like this:

sequences = pd.DataFrame()

for row in df_seq.itertuples(index=True, name='Pandas'):
    sequences = sequences.append(df.loc[row.date:row.end_date].reset_index(drop=True)[:990].transpose())

sequences = sequences.reset_index(drop=True)

This code works, but is terribly slow --> 20-25 min execution time

Is there a way to rewrite this in vectorised operations? Or any other way to improve the performance of this code?

Roy2012 · Accepted Answer

Here's a way to do it. The large dataframe is 'df', and the intervals one is called 'intervals':

inx = pd.date_range(start="2020-01-01", freq="1s", periods=1000)
df = pd.DataFrame(range(len(inx)), index=inx)
df.index.name = "timestamp"

intervals = pd.DataFrame([("2020-01-01 00:00:12","2020-01-01 00:00:18"), 
                   ("2020-01-01 00:01:20","2020-01-01 00:02:03")], 
                  columns=["start_time", "end_time"])

intervals.start_time = pd.to_datetime(intervals.start_time)
intervals.end_time = pd.to_datetime(intervals.end_time)
intervals

t = pd.merge_asof(df.reset_index(), intervals[["start_time"]], left_on="timestamp", right_on="start_time", )
t = pd.merge_asof(t, intervals[["end_time"]], left_on="timestamp", right_on="end_time", direction="forward")

t = t[(t.timestamp >= t.start_time) & (t.timestamp <= t.end_time)]

The result is:

              timestamp    0          start_time            end_time
12  2020-01-01 00:00:12   12 2020-01-01 00:00:12 2020-01-01 00:00:18
13  2020-01-01 00:00:13   13 2020-01-01 00:00:12 2020-01-01 00:00:18
14  2020-01-01 00:00:14   14 2020-01-01 00:00:12 2020-01-01 00:00:18
15  2020-01-01 00:00:15   15 2020-01-01 00:00:12 2020-01-01 00:00:18
16  2020-01-01 00:00:16   16 2020-01-01 00:00:12 2020-01-01 00:00:18
..                  ...  ...                 ...                 ...
119 2020-01-01 00:01:59  119 2020-01-01 00:01:20 2020-01-01 00:02:03
120 2020-01-01 00:02:00  120 2020-01-01 00:01:20 2020-01-01 00:02:03
121 2020-01-01 00:02:01  121 2020-01-01 00:01:20 2020-01-01 00:02:03
122 2020-01-01 00:02:02  122 2020-01-01 00:01:20 2020-01-01 00:02:03
123 2020-01-01 00:02:03  123 2020-01-01 00:01:20 2020-01-01 00:02:03

Python/Pandas: extract intervals from a large dataframe

Answers (2)

Related Questions