Logan
Logan

Reputation: 293

Python - Group by time periods with tolerance

Three columns exist in this data set: ID (unique employee identification), WorkComplete (indicating when all work has been completed), and DateDiff (number of days from their start date). I am looking to group the DaysDiff column based on certain time periods with an added layer of tolerance or leniency. For my mock data, I am spacing the time periods by 30 days.

Group 0: 0-30 DateDiff (with a 30 day extra window if 'Y' is not found)
Group 1: 31-60 DateDiff (with a 30 day extra window if 'Y' is not found)
Group 2: 61-90 DateDiff (with a 30 day extra window if 'Y' is not found)

I was able to create very basic code and assign the groupings, but I am having trouble with the extra 30 day window. For example, if an employee completed their work (Y) during the time periods above, then they receive the attributed grouping. For ID 111 below, you can see that the person did not complete their work within the first 30 days, so I am giving them an addition 30 days to complete their work. If they complete their work, then the first instance we see a 'Y', it is grouped in the previous grouping.

df = pd.DataFrame({'ID':[111, 111, 111, 111, 111, 111, 112, 112, 112],
                   'WorkComplete':['N', 'N', 'Y', 'N', 'N', 'N', 'N', 'Y', 'Y'],
                   'DaysDiff': [0, 29, 45, 46, 47, 88, 1, 12, 89]})

Input

ID   WorkComplete      DaysDiff 
111  N                 0
111  N                 29
111  Y                 45
111  N                 46
111  N                 47
111  N                 88
123  N                 1 
123  Y                 12
123  Y                 89        

Output

ID   WorkComplete      DaysDiff   Group
111  N                 0          0
111  N                 29         0
111  Y                 45         0   <---- note here the grouping is 0 to provide extra time
111  N                 46         1   <---- back to normal
111  N                 47         1   
111  N                 88         2
123  N                 1          0
123  Y                 12         0
123  Y                 89         2
minQ1 = 0
highQ1 = 30
minQ2 = 31
highQ2 = 60
minQ2 = 61
highQ2 = 90

def Group_df(df):
    if (minQ1 <= df['DateDiff'] <= highQ1): return '0'
    elif (minQ1 <= df['DateDiff'] <= highQ1): return '1'
    elif (minQ2 <= df['DateDiff'] <= highQ2): return '2'

df['Group'] = df.apply(Group_df, axis = 1)

The trouble I am having is allowing for the additional 30 days if the person did not complete the work. My above attempt is partial at trying to resolve the issue.

Upvotes: 1

Views: 624

Answers (1)

David Erickson
David Erickson

Reputation: 16683

  1. You can use np.select for the primary conditions.
  2. Then, use mask for the specific condition you mention. s is the first index location for all Y values per group. I then temporarily assign s as a new column, so that I can check for rows against df.index (the index) to return rows that meet the condition. The second condition is if the group number is 1 from the previos line of code:

df['Group'] = np.select([df['DaysDiff'].between(0,30), 
                         df['DaysDiff'].between(31,60), 
                         df['DaysDiff'].between(61,90)],
                         [0,1,2])
s = df[df['WorkComplete'] == 'Y'].groupby('ID')['DaysDiff'].transform('idxmin')
df['Group'] = df['Group'].mask((df.assign(s=s)['s'].eq(df.index)) & (df['Group'].eq(1)), 0)
df
Out[1]: 
    ID WorkComplete  DaysDiff  Group
0  111            N         0      0
1  111            N        29      0
2  111            Y        45      0
3  111            N        46      1
4  111            N        47      1
5  111            N        88      2
6  123            N         1      0
7  123            Y        12      0
8  123            Y        89      2

Upvotes: 1

Related Questions