Reputation: 2335
I have two dataframes one holds unique records of episodic data, the other lists of events. There are multiple events per episode. I need to loop through the episode data, find all the events that correspond to each episode and write the resultant events for a new dataframe. There are around 4,000 episodes and 20,000 events. The process is painfully slow as for each episode I am searching 20,000 events. I am guessing there is a way to reduce the number of events searched each loop by removing the matched ones - but I am not sure. This is my code (there is additional filtering to assist with matching)
for idx, row in episode_df.iterrows():
total_episodes += 1
icu_admission = datetime.strptime(row['ICU_ADM'], '%d/%m/%Y %H:%M:%S')
tmp_df = event_df.loc[event_df['ur'] == row['HRN']]
if ( len(tmp_df.index) < 1):
empty_episodes += 1
continue
# Loop through temp dataframe and write all records with an admission date
# close to icu_admission to new dataframe
for idx_a, row_a in tmp_df.iterrows():
admission = datetime.strptime(row_a['admission'], '%Y-%m-%d %H:%M:%S')
difference = admission - icu_admission
if (abs(difference.total_seconds()) > 14400):
continue
new_df = new_df.append(row_a)
selected_records += 1
A simplified version of the dataframes:
episode_df:
episode_no HRN name ICU_ADM
1 12345 joe date1
2 78124 ann date1
3 98374 bill date2
4 76523 lucy date3
event_df
episode_no ur admission
1 12345 date1
1 12345 date1
1 12345 date5
7 67899 date9
Not all episodes have events and only events with episodes need to be copied.
Upvotes: 0
Views: 28
Reputation: 16561
This could work:
import pandas as pd
import numpy as np
df1 = pd.DataFrame()
df1['ICU_ADM'] = [pd.to_datetime(f'2020-01-{x}') for x in range(1,10)]
df1['test_day'] = df1['ICU_ADM'].dt.day
df2 = pd.DataFrame()
df2['admission'] = [pd.to_datetime(f'2020-01-{x}') for x in range(2,10,3)]
df2['admission_day'] = df2['admission'].dt.day
df2['random_val'] = np.random.rand(len(df2),1)
pd.merge_asof(df1, df2, left_on=['ICU_ADM'], right_on=['admission'], tolerance=pd.Timedelta('1 day'))
Upvotes: 1