user3782604
user3782604

Reputation: 330

Pandas: Join two data frame and apply filter

I have 2 dataframes.

Dataframe 1

Userid | SessionID | Endtime
John   | '' | 0910
Paul   | '' | 0920
.....

Dataframe 2

UserID| SessionID | starttime|end time
John | 0 | 0905 | 0915
Jack | 1 | 0900 | 0915
....

Dataframe 1 has 333975 rows. Dataframe 2 has 2460 rows.

I want to label dataframe 2 with reference to dataframe 1. The match is if user in dateframe 1 = user dataframe 2, and if "endtime" falls between "starttime" and "end time", copy the SessionID from dataframe 1 to dataframe 2.

My code goes like this:

For i in range(len(df1)) :
    For j in range(len(df2)) :
        if(df1['Userid'][1] == df2['UserID']) :
            if((df1['Endtime'] [i] > df2['starttime'][j]) & (df1['Endtime'] [i] < df2['end time'][j])) 
                df1['SessionID' ][i] = df2['SessionID'][j]

Previously when I processed 65k of d1, it takes 30 mins to complete. Now with 333k it takes hours.

Is there a more efficient way to do this kind of labelling?

Update: I have also tried using np.where to do this but it is also taking a long time. It has ran 2 hours and still counting.

Here's my code:

df1['SessionID' ][i] = np.where( (df1['Userid'][1] == df2['UserID']) &  (df1['Endtime'] [i] > df2['starttime'][j]) & (df1['Endtime'] [i] < df2['end time'][j]), df2['SessionID'][j], df1['SessionID' ][i]) 

Upvotes: 1

Views: 695

Answers (2)

user2077935
user2077935

Reputation: 388

You can merge two data frames and apply a filter on top of it.

raw_data = {
    'user_id': ['John', 'Paul'],
    'session_id': [1, 2],
    'end_time' : [910, 920]
}
pd_a = pd.DataFrame(
    raw_data, columns=['user_id', 'session_id', 'end_time']
)

raw_data = {
    'user_id': ['John', 'Paul'],
    'session_id': [1, 2],
    'start_time': [900, 900],
    'end_time' : [915, 925]
}
pd_b = pd.DataFrame(
    raw_data, columns=['user_id', 'session_id', 'start_time', 'end_time']
)

final_pd = pd.merge(pd_a, pd_b, on='user_id')

Output

  user_id  session_id_x  end_time_x  session_id_y  start_time  end_time_y
0    John             1         910             1         900         915
1    Paul             2         920             2         900         925

then, finally apply any filter you want to.

final_pd[final_pd['end_time_x']<=final_pd['end_time_y']]

Upvotes: 4

Emanuele Bellucci
Emanuele Bellucci

Reputation: 58

You can try managing the objects of the second 'if' statement as Pandas Series or lists and then if the condition is satisfied you can perform the labelling on thedataset.

Upvotes: 0

Related Questions