Reputation: 421
I am trying to get employees timecard data, each row has punch start_time and end_time and the time in between can be in any range between 0 Min to 9 Hrs. I want to get for each row how much time has the employee worked in each Hour. I am able to do it in the following way:
import pandas as pd
import numpy as np
pd.options.display.max_columns = 100
store_id = np.repeat(1,10)
employee = [1,2,3,1,2,3,1,2,3,4]
start_time = pd.date_range('2015-07-03', periods=10, freq='25T')
end_time = pd.date_range('2015-07-03', periods = 10,freq = '40T')
df = pd.DataFrame({'store_id':store_id,'employee':employee,'start_time':start_time,'end_time':end_time})
df.head()
employee end_time start_time store_id
0 1 2015-07-03 00:00:00 2015-07-03 00:00:00 1
1 2 2015-07-03 00:40:00 2015-07-03 00:25:00 1
2 3 2015-07-03 01:20:00 2015-07-03 00:50:00 1
3 1 2015-07-03 02:00:00 2015-07-03 01:15:00 1
4 2 2015-07-03 02:40:00 2015-07-03 01:40:00 1
df['date']=df['start_time'].dt.date
def shift_time_in_hr(row):
#hrs- Dictionary assigning each hour as a key to the time worked in that particular hour
hrs = dict(zip(np.arange(0,25),[pd.Timedelta(0)]*24))
#Case 1: if the start time and end time in the same hour then assign minutes to the start hour
if row['start_time'].hour == row['end_time'].hour:
hrs[row['start_time'].hour]= row['end_time']-row['start_time']
return row.append(pd.Series(list(hrs.values()),index = ['{}_hr'.format(i) for i in list(hrs.keys())]))
else:
hrs_worked = np.arange(row['start_time'].hour,row['end_time'].hour+1)
#Case 2: If the start_time and end_time are in different Hours and if there are more hours in between assign them with 60 minutes
if len(hrs_worked)>2:
for i in range(hrs_worked[0]+1,hrs_worked[-1]):
hrs[i] = pd.Timedelta('1 Hour')
#Assign start_time and end_time minutes to respective hours
hrs[hrs_worked[0]] = pd.Timedelta('{} Minutes'.format(60-row['start_time'].minute))
#hrs[hrs_worked[0]] = 60-row['start_time'].minute
hrs[hrs_worked[-1]]= pd.Timedelta('{} Minutes'.format(row['end_time'].minute))
return row.append(pd.Series(list(hrs.values()),index = ['{}_hr'.format(i) for i in list(hrs.keys())]))
df=df.apply(shift_time_in_hr,axis = 1)
df.head()
employee end_time start_time store_id date 0_hr 1_hr 2_hr 3_hr 4_hr 5_hr 6_hr 7_hr 8_hr 9_hr 10_hr 11_hr 12_hr 13_hr 14_hr 15_hr 16_hr 17_hr 18_hr 19_hr 20_hr 21_hr 22_hr 23_hr
0 1 2015-07-03 00:00:00 2015-07-03 00:00:00 1 2015-07-03 00:00:00 00:00:00 00:00:00 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days
1 2 2015-07-03 00:40:00 2015-07-03 00:25:00 1 2015-07-03 00:15:00 00:00:00 00:00:00 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days
2 3 2015-07-03 01:20:00 2015-07-03 00:50:00 1 2015-07-03 00:10:00 00:20:00 00:00:00 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days
3 1 2015-07-03 02:00:00 2015-07-03 01:15:00 1 2015-07-03 00:00:00 00:45:00 00:00:00 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days
4 2 2015-07-03 02:40:00 2015-07-03 01:40:00 1 2015-07-03 00:00:00 00:20:00 00:40:00 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days
dict_agg= dict(zip(df.columns[5:],[np.sum]*24))
df.groupby(['store_id','employee','date']).agg(dict_agg)
Expected Output: In a day the number of minutes the employee worked in each hour.
0_hr 1_hr 2_hr 3_hr 4_hr 5_hr 6_hr 7_hr 8_hr 9_hr 10_hr 11_hr 12_hr 13_hr 14_hr 15_hr 16_hr 17_hr 18_hr 19_hr 20_hr 21_hr 22_hr 23_hr
store_id employee date
1 1 2015-07-03 00:00:00 00:45:00 00:30:00 01:00:00 00:00:00 00:00:00 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days
2 2015-07-03 00:15:00 00:20:00 00:45:00 01:00:00 00:40:00 00:00:00 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days
3 2015-07-03 00:10:00 00:20:00 00:55:00 01:00:00 01:00:00 00:20:00 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days
4 2015-07-03 00:00:00 00:00:00 00:00:00 00:15:00 01:00:00 01:00:00 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days
Is there any better way of doing it or more Pythonic or Pandas way of achieving what I am able to do in a simple way!
Upvotes: 1
Views: 121
Reputation: 4551
This is not a complete answer, but rather a building block that you can use.
Let's calculate the minutes worked given the start and end timestamps, hopefully in a more pandas-friendly way:
import pandas as pd
def diff(ts):
ts[ts < pd.Timedelta(0)] = pd.Timedelta(0)
return (ts - ts.shift(1)).fillna(0)
def calculate_time_worked(start, end):
_range = pd.date_range(start=start.date(),
end=end.date()+pd.Timedelta('1D'),
freq='H')
base = pd.Series(_range)
time_worked = diff(base-start) - diff(base-end)
time_worked = time_worked.apply(lambda x: x.total_seconds() / 60)
time_worked.index = base
return time_worked[time_worked > 0]
start = pd.Timestamp('2017-06-13 20:11')
end = pd.Timestamp('2017-06-13 22:35')
time_worked = calculate_time_worked(start, end)
assert time_worked.to_dict() == {
pd.Timestamp('2017-06-13 21:00'): 49.0,
pd.Timestamp('2017-06-13 22:00'): 60.0,
pd.Timestamp('2017-06-13 23:00'): 35.0}
There are different ways you can go using the function - for example, yield tuples or dicts of (timestamp, time_worked, id, store) and construct a flat dataframe of work time periods, then reshape it desired format, in a join operation. Feel free to build upon this code and hope it is useful.
Upvotes: 1