Reputation: 2678
Here is my dataframe that I am working on. There are two pay periods defined: first 15 days and last 15 days for each month.
date employee_id hours_worked id job_group report_id
0 2016-11-14 2 7.50 385 B 43
1 2016-11-15 2 4.00 386 B 43
2 2016-11-30 2 4.00 387 B 43
3 2016-11-01 3 11.50 388 A 43
4 2016-11-15 3 6.00 389 A 43
5 2016-11-16 3 3.00 390 A 43
6 2016-11-30 3 6.00 391 A 43
I need to group by employee_id and job_group but at the same time I have to achieve date range for that grouped row.
For example grouped results would be like following for employee_id 1:
date employee_id hours_worked job_group report_id
1 2016-11-15 2 11.50 B 43
2 2016-11-30 2 4.00 B 43
4 2016-11-15 3 17.50 A 43
5 2016-11-16 3 9.00 A 43
Is this possible using pandas dataframe groupby?
Upvotes: 1
Views: 781
Reputation: 12396
a. First, (for each employee_id
) use multiple Grouper
with the .sum()
on the hours_worked
column. Second, use DateOffset
to achieve bi-weekly date
column. After these 2 steps, I have assigned the date
in the grouped DF based on 2 brackets (date ranges) - if day of month
(from the date
column) is <=15, then I set the day
in date
to 15, else I set the day
to 30. This day
is then used to assemble a new date
. I calculated month end day based on 1, 2.
b. (For each employee_id
) get the .last()
record for the job_group
and report_id
columns
c. merge a. and b. on the employee_id
key
# a.
hours = (df.groupby([
pd.Grouper(key='employee_id'),
pd.Grouper(key='date', freq='SM')
])['hours_worked']
.sum()
.reset_index())
hours['date'] = pd.to_datetime(hours['date'])
hours['date'] = hours['date'] + pd.DateOffset(days=14)
# Assign day based on bracket (date range) 0-15 or bracket (date range) >15
from pandas.tseries.offsets import MonthEnd
hours['bracket'] = hours['date'] + MonthEnd(0)
hours['bracket'] = pd.to_datetime(hours['bracket']).dt.day
hours.loc[hours['date'].dt.day <= 15, 'bracket'] = 15
hours['date'] = pd.to_datetime(dict(year=hours['date'].dt.year,
month=hours['date'].dt.month,
day=hours['bracket']))
hours.drop('bracket', axis=1, inplace=True)
# b.
others = (df.groupby('employee_id')['job_group','report_id']
.last()
.reset_index())
# c.
merged = hours.merge(others, how='inner', on='employee_id')
Raw data for employee_id==1
and employeeid==3
df.sort_values(by=['employee_id','date'], inplace=True)
print(df[df.employee_id.isin([1,3])])
index date employee_id hours_worked id job_group report_id
0 0 2016-11-14 1 7.5 481 A 43
10 10 2016-11-21 1 6.0 491 A 43
11 11 2016-11-22 1 5.0 492 A 43
15 15 2016-12-14 1 7.5 496 A 43
25 25 2016-12-21 1 6.0 506 A 43
26 26 2016-12-22 1 5.0 507 A 43
6 6 2016-11-02 3 6.0 487 A 43
4 4 2016-11-08 3 6.0 485 A 43
3 3 2016-11-09 3 11.5 484 A 43
5 5 2016-11-11 3 3.0 486 A 43
20 20 2016-11-12 3 3.0 501 A 43
21 21 2016-12-02 3 6.0 502 A 43
19 19 2016-12-08 3 6.0 500 A 43
18 18 2016-12-09 3 11.5 499 A 43
Output
print(merged)
employee_id date hours_worked job_group report_id
0 1 2016-11-15 7.5 A 43
1 1 2016-11-30 11.0 A 43
2 1 2016-12-15 7.5 A 43
3 1 2016-12-31 11.0 A 43
4 2 2016-11-15 31.0 B 43
5 2 2016-12-15 31.0 B 43
6 3 2016-11-15 29.5 A 43
7 3 2016-12-15 23.5 A 43
8 4 2015-03-15 5.0 B 43
9 4 2016-02-29 5.0 B 43
10 4 2016-11-15 5.0 B 43
11 4 2016-11-30 15.0 B 43
12 4 2016-12-15 5.0 B 43
13 4 2016-12-31 15.0 B 43
Upvotes: 1
Reputation: 862441
Use SM
with Grouper
and last add SemiMonthEnd
:
df['date'] = pd.to_datetime(df['date'])
d = {'hours_worked':'sum','report_id':'first'}
df = (df.groupby(['employee_id','job_group',pd.Grouper(freq='SM',key='date', closed='right')])
.agg(d)
.reset_index())
df['date'] = df['date'] + pd.offsets.SemiMonthEnd(1)
print (df)
employee_id job_group date hours_worked report_id
0 2 B 2016-11-15 11.5 43
1 2 B 2016-11-30 4.0 43
2 3 A 2016-11-15 17.5 43
3 3 A 2016-11-30 9.0 43
Upvotes: 1