S_Ymln
S_Ymln

Reputation: 421

Pandas Optimization - How to deal with Pandas time-series data in two columns and get hourly data in between columns

I am trying to get employees timecard data, each row has punch start_time and end_time and the time in between can be in any range between 0 Min to 9 Hrs. I want to get for each row how much time has the employee worked in each Hour. I am able to do it in the following way:

import pandas as pd 
import numpy as np
pd.options.display.max_columns = 100

store_id = np.repeat(1,10)
employee = [1,2,3,1,2,3,1,2,3,4]
start_time = pd.date_range('2015-07-03', periods=10, freq='25T')
end_time = pd.date_range('2015-07-03', periods = 10,freq = '40T')
df = pd.DataFrame({'store_id':store_id,'employee':employee,'start_time':start_time,'end_time':end_time})
df.head()
    employee    end_time    start_time  store_id
0   1   2015-07-03 00:00:00 2015-07-03 00:00:00 1
1   2   2015-07-03 00:40:00 2015-07-03 00:25:00 1
2   3   2015-07-03 01:20:00 2015-07-03 00:50:00 1
3   1   2015-07-03 02:00:00 2015-07-03 01:15:00 1
4   2   2015-07-03 02:40:00 2015-07-03 01:40:00 1

df['date']=df['start_time'].dt.date

def shift_time_in_hr(row):
    #hrs- Dictionary assigning each hour as a key to the time worked in that particular hour
    hrs = dict(zip(np.arange(0,25),[pd.Timedelta(0)]*24))
    #Case 1: if the start time and end time in the same hour then assign minutes to the start hour
    if row['start_time'].hour == row['end_time'].hour:
        hrs[row['start_time'].hour]= row['end_time']-row['start_time']        
        return row.append(pd.Series(list(hrs.values()),index = ['{}_hr'.format(i) for i in list(hrs.keys())]))
    else:
        hrs_worked = np.arange(row['start_time'].hour,row['end_time'].hour+1)
        #Case 2: If the start_time and end_time are in different Hours and if there are more hours in between assign them with 60 minutes
        if len(hrs_worked)>2:
            for i in range(hrs_worked[0]+1,hrs_worked[-1]):
                hrs[i] = pd.Timedelta('1 Hour')
        #Assign start_time and end_time minutes to respective hours
        hrs[hrs_worked[0]] =  pd.Timedelta('{} Minutes'.format(60-row['start_time'].minute))
        #hrs[hrs_worked[0]] =  60-row['start_time'].minute
        hrs[hrs_worked[-1]]= pd.Timedelta('{} Minutes'.format(row['end_time'].minute))
        return row.append(pd.Series(list(hrs.values()),index = ['{}_hr'.format(i) for i in list(hrs.keys())]))

df=df.apply(shift_time_in_hr,axis = 1)

df.head()
employee    end_time    start_time  store_id    date    0_hr    1_hr    2_hr    3_hr    4_hr    5_hr    6_hr    7_hr    8_hr    9_hr    10_hr   11_hr   12_hr   13_hr   14_hr   15_hr   16_hr   17_hr   18_hr   19_hr   20_hr   21_hr   22_hr   23_hr
0   1   2015-07-03 00:00:00 2015-07-03 00:00:00 1   2015-07-03  00:00:00    00:00:00    00:00:00    0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days
1   2   2015-07-03 00:40:00 2015-07-03 00:25:00 1   2015-07-03  00:15:00    00:00:00    00:00:00    0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days
2   3   2015-07-03 01:20:00 2015-07-03 00:50:00 1   2015-07-03  00:10:00    00:20:00    00:00:00    0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days
3   1   2015-07-03 02:00:00 2015-07-03 01:15:00 1   2015-07-03  00:00:00    00:45:00    00:00:00    0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days
4   2   2015-07-03 02:40:00 2015-07-03 01:40:00 1   2015-07-03  00:00:00    00:20:00    00:40:00    0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days

dict_agg= dict(zip(df.columns[5:],[np.sum]*24))
df.groupby(['store_id','employee','date']).agg(dict_agg)

Expected Output: In a day the number of minutes the employee worked in each hour.

                                0_hr    1_hr    2_hr    3_hr    4_hr    5_hr    6_hr    7_hr    8_hr    9_hr    10_hr   11_hr   12_hr   13_hr   14_hr   15_hr   16_hr   17_hr   18_hr   19_hr   20_hr   21_hr   22_hr   23_hr
store_id    employee    date                                                                                                
1           1      2015-07-03   00:00:00    00:45:00    00:30:00    01:00:00    00:00:00    00:00:00    0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days
            2      2015-07-03   00:15:00    00:20:00    00:45:00    01:00:00    00:40:00    00:00:00    0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days
            3      2015-07-03   00:10:00    00:20:00    00:55:00    01:00:00    01:00:00    00:20:00    0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days
            4      2015-07-03   00:00:00    00:00:00    00:00:00    00:15:00    01:00:00    01:00:00    0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days

Is there any better way of doing it or more Pythonic or Pandas way of achieving what I am able to do in a simple way!

Upvotes: 1

Views: 121

Answers (1)

Evgeny
Evgeny

Reputation: 4551

This is not a complete answer, but rather a building block that you can use.

Let's calculate the minutes worked given the start and end timestamps, hopefully in a more pandas-friendly way:

import pandas as pd

def diff(ts):
    ts[ts < pd.Timedelta(0)] = pd.Timedelta(0) 
    return (ts - ts.shift(1)).fillna(0)

def calculate_time_worked(start, end):
    _range = pd.date_range(start=start.date(), 
                           end=end.date()+pd.Timedelta('1D'), 
                           freq='H')
    base = pd.Series(_range)
    time_worked = diff(base-start) - diff(base-end) 
    time_worked = time_worked.apply(lambda x: x.total_seconds() / 60)
    time_worked.index =  base
    return time_worked[time_worked > 0]


start = pd.Timestamp('2017-06-13 20:11')
end = pd.Timestamp('2017-06-13 22:35')
time_worked = calculate_time_worked(start, end)
assert time_worked.to_dict() == {
    pd.Timestamp('2017-06-13 21:00'): 49.0,
    pd.Timestamp('2017-06-13 22:00'): 60.0,
    pd.Timestamp('2017-06-13 23:00'): 35.0}

There are different ways you can go using the function - for example, yield tuples or dicts of (timestamp, time_worked, id, store) and construct a flat dataframe of work time periods, then reshape it desired format, in a join operation. Feel free to build upon this code and hope it is useful.

Upvotes: 1

Related Questions