Reputation: 330
I have 2 dataframes.
Dataframe 1
Userid | SessionID | Endtime
John | '' | 0910
Paul | '' | 0920
.....
Dataframe 2
UserID| SessionID | starttime|end time
John | 0 | 0905 | 0915
Jack | 1 | 0900 | 0915
....
Dataframe 1 has 333975 rows. Dataframe 2 has 2460 rows.
I want to label dataframe 2 with reference to dataframe 1. The match is if user in dateframe 1 = user dataframe 2, and if "endtime" falls between "starttime" and "end time", copy the SessionID from dataframe 1 to dataframe 2.
My code goes like this:
For i in range(len(df1)) :
For j in range(len(df2)) :
if(df1['Userid'][1] == df2['UserID']) :
if((df1['Endtime'] [i] > df2['starttime'][j]) & (df1['Endtime'] [i] < df2['end time'][j]))
df1['SessionID' ][i] = df2['SessionID'][j]
Previously when I processed 65k of d1, it takes 30 mins to complete. Now with 333k it takes hours.
Is there a more efficient way to do this kind of labelling?
Update: I have also tried using np.where to do this but it is also taking a long time. It has ran 2 hours and still counting.
Here's my code:
df1['SessionID' ][i] = np.where( (df1['Userid'][1] == df2['UserID']) & (df1['Endtime'] [i] > df2['starttime'][j]) & (df1['Endtime'] [i] < df2['end time'][j]), df2['SessionID'][j], df1['SessionID' ][i])
Upvotes: 1
Views: 695
Reputation: 388
You can merge two data frames and apply a filter on top of it.
raw_data = {
'user_id': ['John', 'Paul'],
'session_id': [1, 2],
'end_time' : [910, 920]
}
pd_a = pd.DataFrame(
raw_data, columns=['user_id', 'session_id', 'end_time']
)
raw_data = {
'user_id': ['John', 'Paul'],
'session_id': [1, 2],
'start_time': [900, 900],
'end_time' : [915, 925]
}
pd_b = pd.DataFrame(
raw_data, columns=['user_id', 'session_id', 'start_time', 'end_time']
)
final_pd = pd.merge(pd_a, pd_b, on='user_id')
Output
user_id session_id_x end_time_x session_id_y start_time end_time_y
0 John 1 910 1 900 915
1 Paul 2 920 2 900 925
then, finally apply any filter you want to.
final_pd[final_pd['end_time_x']<=final_pd['end_time_y']]
Upvotes: 4
Reputation: 58
You can try managing the objects of the second 'if' statement as Pandas Series or lists and then if the condition is satisfied you can perform the labelling on thedataset.
Upvotes: 0