Reputation: 21
Assuming I have the following dataset saved in a Pandas dataframe - note the last column [Status] is the column I'd like to create:
Department Employee Issue Date Submission Date ***Status***
A Joe 18/05/2014 25/06/2014 0
A Joe 1/06/2014 28/06/2014 1
A Joe 23/06/2014 30/06/2014 2
A Mark 1/03/2015 13/03/2015 0
A Mark 23/04/2015 15/04/2015 0
A William 15/07/2016 30/07/2016 0
A William 1/08/2016 23/08/2016 0
A William 20/08/2016 19/08/2016 1
B Liz 18/05/2014 7/06/2014 0
B Liz 1/06/2014 15/06/2014 1
B Liz 23/06/2014 16/06/2014 0
B John 1/03/2015 13/03/2015 0
B John 23/04/2015 15/04/2015 0
B Alex 15/07/2016 30/07/2016 0
B Alex 1/08/2016 23/08/2016 0
B Alex 20/08/2016 19/08/2016 1
I'd like to create an additional column [Status] based on the following conditions:
As an example: for employee Joe in Department A. When [Issue Date] = '1/06/2014', the previous row's [Submission Date] is after the [Issue Date], therefore [Status] = 1 for row 2. Similarly, when [Issue Date] = '23/06/2014', row 1 & 2's [Submission Date]s are both after the [Issue Date], therefore [Status] = 2 for row 3. We need to perform this calculation for every unique combination of Department and Employee.
Upvotes: 2
Views: 1421
Reputation: 3275
This question was posted 6 months ago but hopefully my answer still provides some help.
First, import the libraries and make the dataframe:
# import libraries
import numpy as np
import pandas as pd
# Make DataFrame
df = pd.DataFrame({'Department' : ['A']*8 + ['B']*8,
'Employee' : ['Joe']*3 +\
['Mark']*2 +\
['William']*3 +\
['Liz']*3 +\
['John']*2 +\
['Alex']*3,
'Issue Date' : ['18/05/2014', '1/06/2014', '23/06/2014',
'1/03/2015', '23/04/2015',
'15/07/2016', '1/08/2016', '20/08/2016',
'18/05/2014', '1/06/2014', '23/06/2014',
'1/03/2015', '23/04/2015',
'15/07/2016', '1/08/2016', '20/08/2016'],
'Submission Date' : ['25/06/2014', '28/06/2014', '30/06/2014',
'13/03/2015', '15/04/2015',
'30/07/2016', '23/08/2016', '19/08/2016',
'7/06/2014', '15/06/2014', '16/06/2014',
'13/03/2015', '15/04/2015',
'30/07/2016', '23/08/2016', '19/08/2016']})
Second, convert Issue Date and Submission Date to datetime:
# Convert 'Issue Date', 'Submission Date' to pd.datetime
df.loc[:, 'Issue Date'] = pd.to_datetime(df.loc[:, 'Issue Date'],
dayfirst = True)
df.loc[:, 'Submission Date'] = pd.to_datetime(df.loc[:, 'Submission Date'],
dayfirst = True)
Third, reset the index and sort the values by Department, Employee, and Issue Date:
# Reset index and sort_values by 'Department', 'Employee', 'Issue Date'
df.reset_index(drop = True).sort_values(by = ['Department',
'Employee',
'Issue Date'],
inplace = True)
Fourth, group by Department, Employee; cumulative count the rows; insert into the originial df:
# Group by 'Department', 'Employee'; cumulative count rows; insert into original df
df.insert(df.shape[1],
'grouped count',
df.groupby(['Department',
'Employee']).cumcount())
Fifth, create a no_issue and no_submission dataframe and merge them together on Department and Employee:
# Create df without 'Issue Date'
no_issue = df.drop('Issue Date', axis = 1)
# Create df without 'Submission Date'
no_submission = df.drop('Submission Date', axis = 1)
# Outer merge no_issue with no_submission on 'Department', 'Employee'
merged = no_issue.merge(no_submission,
how = 'outer',
on = ['Department',
'Employee'])
This duplicates the Submission Date by the number of Issue Dates per Department,Employee group
Here's what it looks like for Joe:
Sixth, create a dataframe that only keeps rows where grouped count_x is less than grouped count_y, then sort by Department, Employee, and Issue Date:
# Create merged1 df that keeps only rows where 'grouped count_x' < 'grouped count_y';
# sort by 'Department', 'Employee', 'Issue Date
merged1 = merged[merged.loc[:, 'grouped count_x'] <
merged.loc[:, 'grouped count_y']].sort_values(by = ['Department',
'Employee',
'Issue Date'])
Seventh, insert the status column as a boolean where Issue Date is less than Submission Date:
# Insert 'Status' as a boolean when 'Issue Date' < 'Submission Date'
merged1.insert(merged.shape[1],
'Status',
merged1.loc[:, 'Issue Date'] < merged1.loc[:, 'Submission Date'])
Eighth, group by Department, Employee, and Issue Date, sum the Status, and reset the index:
# Group by 'Department', 'Employee', 'Issue Date' and sum 'Status'; reset index
merged1 = merged1.groupby(['Department',
'Employee',
'Issue Date']).agg({'Status' : np.sum}).reset_index()
This will return a dataframe with all the correct Statuses minus the minimum Issue Date for each Department, Employee group
Ninth, group the original merged dataframe by Department and Employee, find the minimum Issue Date, and reset the index:
# Group merged by 'Department', 'Employee' and find min 'Issue Date'; reset index
merged = merged.groupby(['Department',
'Employee']).agg({'Issue Date' : 'min'}).reset_index()
Tenth, concatenate merged1 with merged, fill the na with 0 (since the minimum Issue Date will always have a Status of 0) and sort by Department, Employee, and Issue Date:
# Concatenate merged with merged1; fill na with 0; sort by 'Department', 'Employee', 'Issue Date'
concatenated = pd.concat([merged1, merged]).fillna(0).sort_values(by = ['Department',
'Employee',
'Issue Date'])
Eleventh, inner merge the merged dataframe with the concatenated dataframe on Department, Employee, and Issue Date, then drop the grouped count:
# Merge concatenated with df; drop grouped count
final = df.merge(concatenated,
how = 'inner',
on = ['Department',
'Employee',
'Issue Date']).drop('grouped count',
axis = 1)
Voila! Here is your final dataframe:
# Final df
final
Upvotes: 1