Reputation: 3
I have two pandas DataFrames:
I want to use the second Dataframe to extract all sequences out of the first (all rows of the first between the two timestamps for each row of 2. ), then each sequence needs to be transposed into 990 columns and then all sequences have to be combined in a new DataFrame.
So the new DataFrame has one row with 990 columns for each sequence IMG (case row get added later).
Right now my code looks like this:
sequences = pd.DataFrame()
for row in df_seq.itertuples(index=True, name='Pandas'):
sequences = sequences.append(df.loc[row.date:row.end_date].reset_index(drop=True)[:990].transpose())
sequences = sequences.reset_index(drop=True)
This code works, but is terribly slow --> 20-25 min execution time
Is there a way to rewrite this in vectorised operations? Or any other way to improve the performance of this code?
Upvotes: 0
Views: 668
Reputation: 3
After the steps from the answer above I added a groupby and a unstack and the result is exactly the df i need:
Execution time is ~30 seconds!
The full code looks now like this:
sequences = pd.merge_asof(df, df_seq[["date"]], left_on="timestamp", right_on="date", )
sequences = pd.merge_asof(sequences, df_seq[["end_date"]], left_on="timestamp", right_on="end_date", direction="forward")
sequences = sequences[(sequences.timestamp >= sequences.date) & (sequences.timestamp <= sequences.end_date)]
sequences = sequences.groupby('date')['feature_1'].apply(lambda df_temp: df_temp.reset_index(drop=True)).unstack().loc[:,:990]
sequences = sequences.reset_index(drop=True)
Upvotes: 0
Reputation: 12503
Here's a way to do it. The large dataframe is 'df', and the intervals one is called 'intervals':
inx = pd.date_range(start="2020-01-01", freq="1s", periods=1000)
df = pd.DataFrame(range(len(inx)), index=inx)
df.index.name = "timestamp"
intervals = pd.DataFrame([("2020-01-01 00:00:12","2020-01-01 00:00:18"),
("2020-01-01 00:01:20","2020-01-01 00:02:03")],
columns=["start_time", "end_time"])
intervals.start_time = pd.to_datetime(intervals.start_time)
intervals.end_time = pd.to_datetime(intervals.end_time)
intervals
t = pd.merge_asof(df.reset_index(), intervals[["start_time"]], left_on="timestamp", right_on="start_time", )
t = pd.merge_asof(t, intervals[["end_time"]], left_on="timestamp", right_on="end_time", direction="forward")
t = t[(t.timestamp >= t.start_time) & (t.timestamp <= t.end_time)]
The result is:
timestamp 0 start_time end_time
12 2020-01-01 00:00:12 12 2020-01-01 00:00:12 2020-01-01 00:00:18
13 2020-01-01 00:00:13 13 2020-01-01 00:00:12 2020-01-01 00:00:18
14 2020-01-01 00:00:14 14 2020-01-01 00:00:12 2020-01-01 00:00:18
15 2020-01-01 00:00:15 15 2020-01-01 00:00:12 2020-01-01 00:00:18
16 2020-01-01 00:00:16 16 2020-01-01 00:00:12 2020-01-01 00:00:18
.. ... ... ... ...
119 2020-01-01 00:01:59 119 2020-01-01 00:01:20 2020-01-01 00:02:03
120 2020-01-01 00:02:00 120 2020-01-01 00:01:20 2020-01-01 00:02:03
121 2020-01-01 00:02:01 121 2020-01-01 00:01:20 2020-01-01 00:02:03
122 2020-01-01 00:02:02 122 2020-01-01 00:01:20 2020-01-01 00:02:03
123 2020-01-01 00:02:03 123 2020-01-01 00:01:20 2020-01-01 00:02:03
Upvotes: 1