skyman
skyman

Reputation: 2335

Poor performance filtering one dataframe with another

I have two dataframes one holds unique records of episodic data, the other lists of events. There are multiple events per episode. I need to loop through the episode data, find all the events that correspond to each episode and write the resultant events for a new dataframe. There are around 4,000 episodes and 20,000 events. The process is painfully slow as for each episode I am searching 20,000 events. I am guessing there is a way to reduce the number of events searched each loop by removing the matched ones - but I am not sure. This is my code (there is additional filtering to assist with matching)

    for idx, row in episode_df.iterrows():
        total_episodes += 1
    
        icu_admission = datetime.strptime(row['ICU_ADM'], '%d/%m/%Y %H:%M:%S')
    
        tmp_df = event_df.loc[event_df['ur'] == row['HRN']]
        
        if ( len(tmp_df.index) < 1):
            empty_episodes += 1
            continue
        
        # Loop through temp dataframe and write all records with an admission date
        # close to icu_admission to new dataframe
        for idx_a, row_a in tmp_df.iterrows():
            admission = datetime.strptime(row_a['admission'], '%Y-%m-%d %H:%M:%S')
            difference = admission - icu_admission
            
            if (abs(difference.total_seconds()) > 14400):
                continue
            
            new_df = new_df.append(row_a)
            selected_records += 1

A simplified version of the dataframes:

episode_df:

episode_no     HRN     name      ICU_ADM
1           12345      joe         date1
2           78124      ann         date1
3           98374      bill        date2
4           76523      lucy        date3

event_df

episode_no     ur     admission
1           12345      date1
1           12345      date1
1           12345      date5
7           67899      date9

Not all episodes have events and only events with episodes need to be copied.

Upvotes: 0

Views: 28

Answers (1)

SultanOrazbayev
SultanOrazbayev

Reputation: 16561

This could work:

import pandas as pd
import numpy as np

df1 = pd.DataFrame()
df1['ICU_ADM'] = [pd.to_datetime(f'2020-01-{x}') for x in range(1,10)]
df1['test_day'] = df1['ICU_ADM'].dt.day


df2 = pd.DataFrame()
df2['admission'] = [pd.to_datetime(f'2020-01-{x}') for x in range(2,10,3)]
df2['admission_day'] = df2['admission'].dt.day
df2['random_val'] = np.random.rand(len(df2),1)

pd.merge_asof(df1, df2, left_on=['ICU_ADM'], right_on=['admission'], tolerance=pd.Timedelta('1 day'))

Upvotes: 1

Related Questions