Optimising Pandas Function That Compares DataFrames

Question

I've got transcation logs which record usage for a kiosk machine and another set of logs for machine online/offline times. The transaction logs contains a datetime field which lets you know when the transaction (or session) occured.

    event_date  raw_data1   session_id  ws_id
0   2017-11-06 12:13:06 {'description': 'Home'} 0604e80d-1ae6-48d0-81bf-32ca1dc58e4c    machine2
1   2017-11-06 12:13:41 {'description': 'AreYouStillThere'} 0604e80d-1ae6-48d0-81bf-32ca1dc58e4c    machine2
2   2017-11-06 12:14:09 {'description': 'AttractiveAnimation'}  0604e80d-1ae6-48d0-81bf-32ca1dc58e4c    machine2
3   2017-11-07 10:06:15 {'description': 'Home'} e2e7565f-60b4-4e7b-a8f0-d0a9c384b283    machine13
4   2017-11-07 10:06:27 {'description': 'AuthenticationPanelAdmin'} e2e7565f-60b4-4e7b-a8f0-d0a9c384b283    machine13

The goal of this function is to see which session_ids conincide with an offline log

    dtrange start   end status  machine_id
0   DateTimeTZRange(datetime.datetime(2017, 11, 17...   2017-11-17 14:46:15 2017-11-17 15:01:15 2   12
1   DateTimeTZRange(datetime.datetime(2017, 11, 17...   2017-11-17 14:47:02 2017-11-17 15:02:02 2   22
2   DateTimeTZRange(datetime.datetime(2017, 11, 17...   2017-11-17 14:47:23 2017-11-17 15:02:23 2   18
3   DateTimeTZRange(datetime.datetime(2017, 11, 17...   2017-11-17 14:48:09 2017-11-17 15:03:09 2   17
4   DateTimeTZRange(datetime.datetime(2017, 11, 17...   2017-11-17 14:49:18 2017-11-17 15:04:18 2   15

ws_id and machine_id are the same, and this makes it a little trickier as the session time and machine_id must match across both dataframes.

This is the code I'm using to return all session_ids that occured when a machine is offline. It filters the offline dataframe with each row from the transaction dataframe and returns a session_id if an offline event coincided with a session time:

def CheckSession(machinename, sessiontime, sessionid):
    if len(offlinedf[(offlinedf.startsessiontime)
             &(offlinedf.name==machinename)])>0:
        return sessionid

sessions = df.apply(lambda row: CheckSession(row["name"], row["created_at1"], row["session_id"]), axis=1)

This builds the list of sessions, but it is very slow and the dataframes are quite large. I'm still learning how best to work with the pandas library - I was hoping to optimise it using some vectorization but haven't been able to work out how to build it that way.

Optimising Pandas Function That Compares DataFrames

Answers (1)

Related Questions