Lata
Lata

Reputation: 117

Pandas: Not getting the desired transformation

Name        date    leave   marked_leave_bfr_days
   A    8/1/2021        1                       3
   A    8/2/2021        1                       4
   A    8/3/2021        1                       5
   A    8/4/2021        1                       5
   A    8/5/2021        1                       6
   A    8/6/2021        1                       7
   A    8/7/2021        1                       8
   A    8/8/2021        0                      -1
   A    8/9/2021        0                      -1
   A   8/10/2021        1                      12
   A   8/11/2021        1                      13
   A   8/12/2021        0                      -1
   B    8/4/2021        1                       1
   B    8/5/2021        1                       1
   B    8/6/2021        1                       3
   B    8/7/2021        0                      -1
   B    8/8/2021        0                      -1
   B    8/9/2021        0                      -1
   B   8/10/2021        0                      -1
   B   8/11/2021        0                      -1

My df is like above. Name is employee code, leave is boolean (=1 means leave, =0 means no leave), marked_leave_bfr_days is how many days prior to date was leave applied.

I wish to transform it to this df:

Name        date    leave   marked_leave_bfr_days   leave_applied   leave_start   leave_end no_of_leaves
   A    8/1/2021        1                       3       7/29/2021      8/1/2021    8/3/2021            3
   A    8/2/2021        1                       4       7/29/2021      8/1/2021    8/3/2021            3
   A    8/3/2021        1                       5       7/29/2021      8/1/2021    8/3/2021            3
   A    8/4/2021        1                       5       7/30/2021      8/4/2021    8/7/2021            4
   A    8/5/2021        1                       6       7/30/2021      8/4/2021    8/7/2021            4
   A    8/6/2021        1                       7       7/30/2021      8/4/2021    8/7/2021            4
   A    8/7/2021        1                       8       7/30/2021      8/4/2021    8/7/2021            4
   A    8/8/2021        0                      -1              -1            -1          -1           -1
   A    8/9/2021        0                      -1              -1            -1          -1           -1
   A   8/10/2021        1                      12       7/29/2021     8/10/2021   8/11/2021            2
   A   8/11/2021        1                      13       7/29/2021     8/10/2021   8/11/2021            2
   A   8/12/2021        0                      -1              -1            -1          -1           -1
   B    8/4/2021        1                       1        8/3/2021      8/4/2021    8/4/2021            1
   B    8/5/2021        1                       1        8/4/2021      8/5/2021    8/5/2021            1
   B    8/6/2021        1                       3        8/3/2021      8/6/2021    8/6/2021            1
   B    8/7/2021        0                      -1              -1            -1          -1           -1
   B    8/8/2021        0                      -1              -1            -1          -1           -1
   B    8/9/2021        0                      -1              -1            -1          -1           -1
   B   8/10/2021        0                      -1              -1            -1          -1           -1
   B   8/11/2021        0                      -1              -1            -1          -1           -1

I am doing this:

df.loc[df.leave==1, 'leave_applied'] = (df['date'] - df['marked_leave_bfr_days'].map(timedelta))
df = df[df.leave==1].groupby(['Name', 'leave_applied').agg({'date':['min', 'max']}).reset_index()

But, this is not working as there are two separate chunk of leaves applied on same date by User. Is there a way to work around this?

Upvotes: 0

Views: 225

Answers (3)

igrinis
igrinis

Reputation: 13656

Just add a helper column to indicate same date bulk:

df['leave_applied'] = (df['date'] - df['marked_leave_bfr_days'].map(lambda x: pd.Timedelta(days=x))).dt.date

# create helper
df['hlp_idx'] = (df.leave_applied != df.leave_applied.shift()).cumsum()

# use helper column to create proper date groups 
date_grp = df.groupby(['Name', 'leave_applied','hlp_idx'])['date']

# create desired values
df['leave_start'] = date_grp.transform('min').dt.date
df['leave_end'] = date_grp.transform('max').dt.date
df['no_of_leaves'] = date_grp.transform('count')

# clean up
df.loc[df['marked_leave_bfr_days']==-1,['leave_applied', 'leave_start', 'leave_end','no_of_leaves']] = -1
df = df.drop(columns='hlp_idx')

This will provide the desired result.

Upvotes: 3

sadegh arefizadeh
sadegh arefizadeh

Reputation: 139

Your question is quite vague. It is unclear whether you have a problem with the code or are trying to find suitable logic to implement your strategy. It would be helpful if you explained in detail what the date columns mean and how you want them transformed, since your current explanation is not quite clear. For instance, it is difficult to understand why for the same name say A on 8/3/2021 the marked_leave_bfr_days is 5, and a day later on 8/4/2021, the marked_leave_bfr_days is still 5 while logically it should be 6. Anyway, I ran the code below, and there seem to be no problems:

import pandas as pd
import numpy as np
from pandas import Timedelta
from itertools import compress
date = [pd.Timestamp('8/{}/2021'.format(i)) for i in range(1,13) ]
date.extend([pd.Timestamp('8/{}/2021'.format(i)) for i in range(4,12) ])
name = ['A']*12+['B']*8
leave = [1]*7+[0]*2 + [1]*2 + [0]+ [1]*3 + [0]*5
marked_leave_bfr_days = [3,4,5,5,6,7,8,-1,-1,12,13,-1,1,1,3,-1,-1,-1,-1,-1]
data = {'Name': name, 'date':date, 'leave': leave,'marked_leave_bfr_days': marked_leave_bfr_days}
df = pd.DataFrame(data=data)

df.loc[df.leave==1, 'leave_applied'] = (df['date'] - df['marked_leave_bfr_days'].map(Timedelta))
df.loc[df.leave==0 ,'leave_applied'] = -1
df = df[df.leave==1].groupby(['Name', 'leave_applied']).agg({'date':['min', 'max']}).reset_index()

Upvotes: 1

cg-zhou
cg-zhou

Reputation: 563

Perhaps it's more easier to iterate rows, because the business logic is too complicated.

In this answer How to iterate over rows in a DataFrame in Pandas , it doesn't recommend to loop rows.

I wrote a demo as below, hope it helps:

import pandas as pd
from datetime import datetime, timedelta

file = 'D:/test/TestPython/leave.csv' 
df = pd.read_csv(file)

df['leave_applied'] = ''
df['leave_start'] = ''
df['leave_end'] = ''
df['no_of_leaves'] = ''

indexes = []
previous_leave_applied = ''
leave_start = ''
leave_end = ''

# when a leave duration ends, fill values in the leave duration.
def fill_leave_duration(indexes, leave_start, leave_end):
    for index in indexes:
        df.loc[index, 'leave_start'] = leave_start.strftime('%#m/%#d/%Y')
        df.loc[index, 'leave_end'] = leave_end.strftime('%#m/%#d/%Y')
        df.loc[index, 'no_of_leaves'] = (leave_end - leave_start).days + 1

for index, row in df.iterrows():
    if (row.leave == 0):
        df.loc[index, 'leave_applied'] = -1
        df.loc[index, 'leave_start'] = -1
        df.loc[index, 'leave_end'] = -1
        df.loc[index, 'no_of_leaves'] = -1
    else:
        date = datetime.strptime(row.date, "%m/%d/%Y")
        leave_applied = date - timedelta(row.marked_leave_bfr_days)
        df.loc[index, 'leave_applied'] = leave_applied.strftime('%#m/%#d/%Y')

        # check whether it is a new start of leave duration.
        if (leave_applied != previous_leave_applied):
            # clear previous row indexes.
            indexes = []
            leave_start = date
        indexes.append(index)

        leave_end = date
        fill_leave_duration(indexes, leave_start, leave_end)
        previous_leave_applied = leave_applied

df.to_csv('D:/test/TestPython/leave_result.csv', index=False)

And the execution result: enter image description here

Upvotes: 0

Related Questions