A. McMaster
A. McMaster

Reputation: 303

Pandas - DateTime within X amount of minutes from row

I am not entirely positive the best way to ask or phrase this question so I will highlight my problem, dataset, my thoughts on the method and end goal and hopefully it will be clear by the end.

My problem:

My company dispatches workers and will load up dispatches to a single employee even if they are on their current dispatch. This is due to limitation in the software we use. If an employee receives two dispatches within 30 minutes, we call this a double dispatch.

We are analyzing our dispatching efficiency and I am running into a bit of a head scratcher. I need to run through our 100k row database and add an additional column that will read as a dummy variable 1 for double 0 for normal. BUT as we have multiple people we dispatch and B our records do not start ordered by dispatch, I need to determine how often a dispatch occurs to the same person within 30 minutes.

Dataset:

The dataset is incredibly massive due to poor organization in our data warehouse but for terms of what items I need these are the columns I will need for my calc.

Tech Name         |  Dispatch Time (PST)   
John Smith        |  1/1/2017 12:34
Jane Smith        |  1/1/2017 12:46
John Smith        |  1/1/2017 18:32
John Smith        |  1/1/2017 18:50

My Thoughts: How I would do it is clunky and it could work one way but not backwards. I would more or less write my code as:

import pandas as pd

df = pd.read_excel('data.xlsx')
df.sort('Dispatch Time (PST)', inplace = True)

tech_name = None
dispatch_time = pd.to_datetime('1/1/1900 00:00:00')

for index, row in df.iterrows():
    if tech_name is None:
        tech_name = row['Tech Name']
    else:
        if dispatch_time.pd.time_delta('0 Days 00:30:00') > row['Tech Dispatch Time (PST)'] AND row['Tech Name'] = tech_name:
            row['Double Dispatch'] = 1
            dispatch_time = row['Tech Dispatch Time (PST)']
        else:
            dispatch_time = row['Tech Dispatch Time (PST)']
            tech_name = row['Tech Name']

This has many problems from being slow, only tracking dates going backwards and not forwards so I will be missing many dispatches.

End Goal:

My goal is to have a dataset I can then plug back into Tableau for my report by adding on one column that reads as that dummy variable so I can filter and calculate on that.

I appreciate your time and help and let me know if any more details are necessary.

Thank you!

------------------ EDIT ------------- Added a edit to make the question clear as I failed to do so earlier.

Question: Is Pandas the best tool to use to iterate over my dataframe to see each for each datetime dispatch, is there a record that matches the Tech's Name AND is less then 30 minutes away from this record.

If so, how could I improve my algorithm or theory, if not what would the best tool be.

Desired Output - An additional column that records if a dispatch happened within a 30 minute window as a dummy variable 1 for True 0 for False. I need to see when double dispatches are occuring and how many records are true double dispatches, and not just a count that says there were 100 instances of double dispatch, but that involved over 200 records. I need to be able to sort and see each record.

Upvotes: 2

Views: 344

Answers (1)

A. McMaster
A. McMaster

Reputation: 303

Hello I think I found a solution. It slow, only compares one index before or after, but in terms of cases that have 3 dispatches within thirty minutes, this represents less then .5 % for us.

import pandas as pd
import numpy as np
import datetime as dt

dispatch = 'Tech Dispatched Date-Time (PST)'
tech = 'CombinedTech'
df = pd.read_excel('combined_data.xlsx')
df.sort_values(dispatch, inplace=True)
df.reset_index(inplace = True)
df['Double Dispatch'] = np.NaN

writer = pd.ExcelWriter('final_output.xlsx', engine='xlsxwriter')

dispatch_count = 0
time = dt.timedelta(minutes = 30)

for index, row in df.iterrows():
    try:
        tech_one = df[tech].loc[(index - 1)]
        dispatch_one = df[dispatch].loc[(index - 1)]
    except KeyError:
        tech_one = None
        dispatch_one = pd.to_datetime('1/1/1990 00:00:00')
    try:
        tech_two = df[tech].loc[(index + 1)]
        dispatch_two = df[dispatch].loc[(index + 1)]
    except KeyError:
        tech_two = None
        dispatch_two = pd.to_datetime('1/1/2020 00:00:00')
    first_time = dispatch_one + time
    second_time = pd.to_datetime(row[dispatch]) + time 
    dispatch_pd = pd.to_datetime(row[dispatch])
    if tech_one == row[tech] or tech_two == row[tech]:
        if first_time > row[dispatch] or second_time > dispatch_two:
            df.set_value(index, 'Double Dispatch', 1)
            dispatch_count += 1
        else:
            df.set_value(index, 'Double Dispatch', 0)
            dispatch_count += 1
    print(dispatch_count) # This was to monitor total # of records being pushed through

df.to_excel(writer,sheet_name='Sheet1')
writer.save()
writer.close()

Upvotes: 1

Related Questions