Kishan Mehta
Kishan Mehta

Reputation: 2678

Pandas dataframe Groupby and retrieve date range

Here is my dataframe that I am working on. There are two pay periods defined: first 15 days and last 15 days for each month.

         date  employee_id hours_worked   id job_group  report_id
0  2016-11-14            2         7.50  385         B         43
1  2016-11-15            2         4.00  386         B         43
2  2016-11-30            2         4.00  387         B         43
3  2016-11-01            3        11.50  388         A         43
4  2016-11-15            3         6.00  389         A         43
5  2016-11-16            3         3.00  390         A         43
6  2016-11-30            3         6.00  391         A         43

I need to group by employee_id and job_group but at the same time I have to achieve date range for that grouped row.

For example grouped results would be like following for employee_id 1:

Expected Output:

         date  employee_id hours_worked  job_group  report_id
1  2016-11-15            2         11.50        B         43
2  2016-11-30            2         4.00         B         43
4  2016-11-15            3         17.50        A         43
5  2016-11-16            3         9.00         A         43

Is this possible using pandas dataframe groupby?

Upvotes: 1

Views: 781

Answers (2)

edesz
edesz

Reputation: 12396

a. First, (for each employee_id) use multiple Grouper with the .sum() on the hours_worked column. Second, use DateOffset to achieve bi-weekly date column. After these 2 steps, I have assigned the date in the grouped DF based on 2 brackets (date ranges) - if day of month (from the date column) is <=15, then I set the day in date to 15, else I set the day to 30. This day is then used to assemble a new date. I calculated month end day based on 1, 2.

b. (For each employee_id) get the .last() record for the job_group and report_id columns

c. merge a. and b. on the employee_id key

# a.
hours = (df.groupby([
            pd.Grouper(key='employee_id'),
            pd.Grouper(key='date', freq='SM')
                    ])['hours_worked']
            .sum()
            .reset_index())
hours['date'] = pd.to_datetime(hours['date'])
hours['date'] = hours['date'] + pd.DateOffset(days=14)

# Assign day based on bracket (date range) 0-15 or bracket (date range) >15
from pandas.tseries.offsets import MonthEnd
hours['bracket'] = hours['date'] + MonthEnd(0)
hours['bracket'] = pd.to_datetime(hours['bracket']).dt.day
hours.loc[hours['date'].dt.day <= 15, 'bracket'] = 15
hours['date'] = pd.to_datetime(dict(year=hours['date'].dt.year,
                                    month=hours['date'].dt.month,
                                    day=hours['bracket']))
hours.drop('bracket', axis=1, inplace=True)

# b.
others = (df.groupby('employee_id')['job_group','report_id']
            .last()
            .reset_index())

# c.
merged = hours.merge(others, how='inner', on='employee_id')

Raw data for employee_id==1 and employeeid==3

df.sort_values(by=['employee_id','date'], inplace=True)
print(df[df.employee_id.isin([1,3])])

    index       date  employee_id  hours_worked   id job_group  report_id
0       0 2016-11-14            1           7.5  481         A         43
10     10 2016-11-21            1           6.0  491         A         43
11     11 2016-11-22            1           5.0  492         A         43
15     15 2016-12-14            1           7.5  496         A         43
25     25 2016-12-21            1           6.0  506         A         43
26     26 2016-12-22            1           5.0  507         A         43
6       6 2016-11-02            3           6.0  487         A         43
4       4 2016-11-08            3           6.0  485         A         43
3       3 2016-11-09            3          11.5  484         A         43
5       5 2016-11-11            3           3.0  486         A         43
20     20 2016-11-12            3           3.0  501         A         43
21     21 2016-12-02            3           6.0  502         A         43
19     19 2016-12-08            3           6.0  500         A         43
18     18 2016-12-09            3          11.5  499         A         43

Output

print(merged)

    employee_id       date  hours_worked job_group  report_id
0             1 2016-11-15           7.5         A         43
1             1 2016-11-30          11.0         A         43
2             1 2016-12-15           7.5         A         43
3             1 2016-12-31          11.0         A         43
4             2 2016-11-15          31.0         B         43
5             2 2016-12-15          31.0         B         43
6             3 2016-11-15          29.5         A         43
7             3 2016-12-15          23.5         A         43
8             4 2015-03-15           5.0         B         43
9             4 2016-02-29           5.0         B         43
10            4 2016-11-15           5.0         B         43
11            4 2016-11-30          15.0         B         43
12            4 2016-12-15           5.0         B         43
13            4 2016-12-31          15.0         B         43

Upvotes: 1

jezrael
jezrael

Reputation: 862441

Use SM with Grouper and last add SemiMonthEnd:

df['date'] = pd.to_datetime(df['date'])

d = {'hours_worked':'sum','report_id':'first'}
df = (df.groupby(['employee_id','job_group',pd.Grouper(freq='SM',key='date', closed='right')])
       .agg(d)
       .reset_index())

df['date'] = df['date'] + pd.offsets.SemiMonthEnd(1)
print (df)
   employee_id job_group       date  hours_worked  report_id
0            2         B 2016-11-15          11.5         43
1            2         B 2016-11-30           4.0         43
2            3         A 2016-11-15          17.5         43
3            3         A 2016-11-30           9.0         43

Upvotes: 1

Related Questions