Jona Wössner
Jona Wössner

Reputation: 3

Python/Pandas: extract intervals from a large dataframe

I have two pandas DataFrames:

  1. 20 million rows of continues time series data with DateTime Index (df) IMG
  2. 20 thousand rows with two timestamps (df_seq) IMG

I want to use the second Dataframe to extract all sequences out of the first (all rows of the first between the two timestamps for each row of 2. ), then each sequence needs to be transposed into 990 columns and then all sequences have to be combined in a new DataFrame.

So the new DataFrame has one row with 990 columns for each sequence IMG (case row get added later).

Right now my code looks like this:

sequences = pd.DataFrame()

for row in df_seq.itertuples(index=True, name='Pandas'):
    sequences = sequences.append(df.loc[row.date:row.end_date].reset_index(drop=True)[:990].transpose())

sequences = sequences.reset_index(drop=True)

This code works, but is terribly slow --> 20-25 min execution time

Is there a way to rewrite this in vectorised operations? Or any other way to improve the performance of this code?

Upvotes: 0

Views: 668

Answers (2)

Jona Wössner
Jona Wössner

Reputation: 3

After the steps from the answer above I added a groupby and a unstack and the result is exactly the df i need:

Execution time is ~30 seconds!

The full code looks now like this:

sequences = pd.merge_asof(df, df_seq[["date"]], left_on="timestamp", right_on="date", )
sequences = pd.merge_asof(sequences, df_seq[["end_date"]], left_on="timestamp", right_on="end_date", direction="forward")
sequences = sequences[(sequences.timestamp >= sequences.date) & (sequences.timestamp <= sequences.end_date)]

sequences = sequences.groupby('date')['feature_1'].apply(lambda df_temp: df_temp.reset_index(drop=True)).unstack().loc[:,:990]
sequences = sequences.reset_index(drop=True)

Upvotes: 0

Roy2012
Roy2012

Reputation: 12503

Here's a way to do it. The large dataframe is 'df', and the intervals one is called 'intervals':

inx = pd.date_range(start="2020-01-01", freq="1s", periods=1000)
df = pd.DataFrame(range(len(inx)), index=inx)
df.index.name = "timestamp"

intervals = pd.DataFrame([("2020-01-01 00:00:12","2020-01-01 00:00:18"), 
                   ("2020-01-01 00:01:20","2020-01-01 00:02:03")], 
                  columns=["start_time", "end_time"])

intervals.start_time = pd.to_datetime(intervals.start_time)
intervals.end_time = pd.to_datetime(intervals.end_time)
intervals

t = pd.merge_asof(df.reset_index(), intervals[["start_time"]], left_on="timestamp", right_on="start_time", )
t = pd.merge_asof(t, intervals[["end_time"]], left_on="timestamp", right_on="end_time", direction="forward")

t = t[(t.timestamp >= t.start_time) & (t.timestamp <= t.end_time)]

The result is:

              timestamp    0          start_time            end_time
12  2020-01-01 00:00:12   12 2020-01-01 00:00:12 2020-01-01 00:00:18
13  2020-01-01 00:00:13   13 2020-01-01 00:00:12 2020-01-01 00:00:18
14  2020-01-01 00:00:14   14 2020-01-01 00:00:12 2020-01-01 00:00:18
15  2020-01-01 00:00:15   15 2020-01-01 00:00:12 2020-01-01 00:00:18
16  2020-01-01 00:00:16   16 2020-01-01 00:00:12 2020-01-01 00:00:18
..                  ...  ...                 ...                 ...
119 2020-01-01 00:01:59  119 2020-01-01 00:01:20 2020-01-01 00:02:03
120 2020-01-01 00:02:00  120 2020-01-01 00:01:20 2020-01-01 00:02:03
121 2020-01-01 00:02:01  121 2020-01-01 00:01:20 2020-01-01 00:02:03
122 2020-01-01 00:02:02  122 2020-01-01 00:01:20 2020-01-01 00:02:03
123 2020-01-01 00:02:03  123 2020-01-01 00:01:20 2020-01-01 00:02:03

Upvotes: 1

Related Questions