Reputation: 117
Name date leave marked_leave_bfr_days
A 8/1/2021 1 3
A 8/2/2021 1 4
A 8/3/2021 1 5
A 8/4/2021 1 5
A 8/5/2021 1 6
A 8/6/2021 1 7
A 8/7/2021 1 8
A 8/8/2021 0 -1
A 8/9/2021 0 -1
A 8/10/2021 1 12
A 8/11/2021 1 13
A 8/12/2021 0 -1
B 8/4/2021 1 1
B 8/5/2021 1 1
B 8/6/2021 1 3
B 8/7/2021 0 -1
B 8/8/2021 0 -1
B 8/9/2021 0 -1
B 8/10/2021 0 -1
B 8/11/2021 0 -1
My df is like above. Name is employee code, leave is boolean (=1 means leave, =0 means no leave), marked_leave_bfr_days is how many days prior to date was leave applied.
I wish to transform it to this df:
Name date leave marked_leave_bfr_days leave_applied leave_start leave_end no_of_leaves
A 8/1/2021 1 3 7/29/2021 8/1/2021 8/3/2021 3
A 8/2/2021 1 4 7/29/2021 8/1/2021 8/3/2021 3
A 8/3/2021 1 5 7/29/2021 8/1/2021 8/3/2021 3
A 8/4/2021 1 5 7/30/2021 8/4/2021 8/7/2021 4
A 8/5/2021 1 6 7/30/2021 8/4/2021 8/7/2021 4
A 8/6/2021 1 7 7/30/2021 8/4/2021 8/7/2021 4
A 8/7/2021 1 8 7/30/2021 8/4/2021 8/7/2021 4
A 8/8/2021 0 -1 -1 -1 -1 -1
A 8/9/2021 0 -1 -1 -1 -1 -1
A 8/10/2021 1 12 7/29/2021 8/10/2021 8/11/2021 2
A 8/11/2021 1 13 7/29/2021 8/10/2021 8/11/2021 2
A 8/12/2021 0 -1 -1 -1 -1 -1
B 8/4/2021 1 1 8/3/2021 8/4/2021 8/4/2021 1
B 8/5/2021 1 1 8/4/2021 8/5/2021 8/5/2021 1
B 8/6/2021 1 3 8/3/2021 8/6/2021 8/6/2021 1
B 8/7/2021 0 -1 -1 -1 -1 -1
B 8/8/2021 0 -1 -1 -1 -1 -1
B 8/9/2021 0 -1 -1 -1 -1 -1
B 8/10/2021 0 -1 -1 -1 -1 -1
B 8/11/2021 0 -1 -1 -1 -1 -1
I am doing this:
df.loc[df.leave==1, 'leave_applied'] = (df['date'] - df['marked_leave_bfr_days'].map(timedelta))
df = df[df.leave==1].groupby(['Name', 'leave_applied').agg({'date':['min', 'max']}).reset_index()
But, this is not working as there are two separate chunk of leaves applied on same date by User. Is there a way to work around this?
Upvotes: 0
Views: 225
Reputation: 13656
Just add a helper column to indicate same date bulk:
df['leave_applied'] = (df['date'] - df['marked_leave_bfr_days'].map(lambda x: pd.Timedelta(days=x))).dt.date
# create helper
df['hlp_idx'] = (df.leave_applied != df.leave_applied.shift()).cumsum()
# use helper column to create proper date groups
date_grp = df.groupby(['Name', 'leave_applied','hlp_idx'])['date']
# create desired values
df['leave_start'] = date_grp.transform('min').dt.date
df['leave_end'] = date_grp.transform('max').dt.date
df['no_of_leaves'] = date_grp.transform('count')
# clean up
df.loc[df['marked_leave_bfr_days']==-1,['leave_applied', 'leave_start', 'leave_end','no_of_leaves']] = -1
df = df.drop(columns='hlp_idx')
This will provide the desired result.
Upvotes: 3
Reputation: 139
Your question is quite vague. It is unclear whether you have a problem with the code or are trying to find suitable logic to implement your strategy. It would be helpful if you explained in detail what the date columns mean and how you want them transformed, since your current explanation is not quite clear. For instance, it is difficult to understand why for the same name say A on 8/3/2021 the marked_leave_bfr_days is 5, and a day later on 8/4/2021, the marked_leave_bfr_days is still 5 while logically it should be 6. Anyway, I ran the code below, and there seem to be no problems:
import pandas as pd
import numpy as np
from pandas import Timedelta
from itertools import compress
date = [pd.Timestamp('8/{}/2021'.format(i)) for i in range(1,13) ]
date.extend([pd.Timestamp('8/{}/2021'.format(i)) for i in range(4,12) ])
name = ['A']*12+['B']*8
leave = [1]*7+[0]*2 + [1]*2 + [0]+ [1]*3 + [0]*5
marked_leave_bfr_days = [3,4,5,5,6,7,8,-1,-1,12,13,-1,1,1,3,-1,-1,-1,-1,-1]
data = {'Name': name, 'date':date, 'leave': leave,'marked_leave_bfr_days': marked_leave_bfr_days}
df = pd.DataFrame(data=data)
df.loc[df.leave==1, 'leave_applied'] = (df['date'] - df['marked_leave_bfr_days'].map(Timedelta))
df.loc[df.leave==0 ,'leave_applied'] = -1
df = df[df.leave==1].groupby(['Name', 'leave_applied']).agg({'date':['min', 'max']}).reset_index()
Upvotes: 1
Reputation: 563
Perhaps it's more easier to iterate rows, because the business logic is too complicated.
In this answer How to iterate over rows in a DataFrame in Pandas , it doesn't recommend to loop rows.
I wrote a demo as below, hope it helps:
import pandas as pd
from datetime import datetime, timedelta
file = 'D:/test/TestPython/leave.csv'
df = pd.read_csv(file)
df['leave_applied'] = ''
df['leave_start'] = ''
df['leave_end'] = ''
df['no_of_leaves'] = ''
indexes = []
previous_leave_applied = ''
leave_start = ''
leave_end = ''
# when a leave duration ends, fill values in the leave duration.
def fill_leave_duration(indexes, leave_start, leave_end):
for index in indexes:
df.loc[index, 'leave_start'] = leave_start.strftime('%#m/%#d/%Y')
df.loc[index, 'leave_end'] = leave_end.strftime('%#m/%#d/%Y')
df.loc[index, 'no_of_leaves'] = (leave_end - leave_start).days + 1
for index, row in df.iterrows():
if (row.leave == 0):
df.loc[index, 'leave_applied'] = -1
df.loc[index, 'leave_start'] = -1
df.loc[index, 'leave_end'] = -1
df.loc[index, 'no_of_leaves'] = -1
else:
date = datetime.strptime(row.date, "%m/%d/%Y")
leave_applied = date - timedelta(row.marked_leave_bfr_days)
df.loc[index, 'leave_applied'] = leave_applied.strftime('%#m/%#d/%Y')
# check whether it is a new start of leave duration.
if (leave_applied != previous_leave_applied):
# clear previous row indexes.
indexes = []
leave_start = date
indexes.append(index)
leave_end = date
fill_leave_duration(indexes, leave_start, leave_end)
previous_leave_applied = leave_applied
df.to_csv('D:/test/TestPython/leave_result.csv', index=False)
Upvotes: 0