Assuming I have the following dataset saved in a Pandas dataframe - note the last column [Status] is the column I'd like to create: Department Employee Issue Date Submission Date ***Status*** A Joe 18/05/2014 25/06/2014 0 A Joe 1/06/2014 28/06/2014 1 A Joe 23/06/2014 30/06/2014 2 A Mark 1/03/2015 13/03/2015 0 A Mark 23/04/2015 15/04/2015 0 A William 15/07/2016 30/07/2016 0 A William 1/08/2016 23/08/2016 0 A William 20/08/2016 19/08/2016 1 B Liz 18/05/2014 7/06/2014 0 B Liz 1/06/2014 15/06/2014 1 B Liz 23/06/2014 16/06/2014 0 B John 1/03/2015 13/03/2015 0 B John 23/04/2015 15/04/2015 0 B Alex 15/07/2016 30/07/2016 0 B Alex 1/08/2016 23/08/2016 0 B Alex 20/08/2016 19/08/2016 1 I'd like to create an additional column [Status] based on the following conditions: For every unique [Department] & [Employee] combination (e.g. there are three rows corresponding to Joe in Department A), sort the [Issue Date] from oldest to newest If the current row [Issue Date] is greater than ALL previous rows [Submission Date], then flag the [Status] with 0; else [Status] = no of times that [Issue Date] < [Submission Date] As an example: for employee Joe in Department A. When [Issue Date] = '1/06/2014', the previous row's [Submission Date] is after the [Issue Date], therefore [Status] = 1 for row 2. Similarly, when [Issue Date] = '23/06/2014', row 1 & 2's [Submission Date]s are both after the [Issue Date], therefore [Status] = 2 for row 3. We need to perform this calculation for every unique combination of Department and Employee. Note: real dataset is not sorted nicely as the displayed example.

Reputation: 21

Pandas dataframe: Create additional column based on date columns comparison

Assuming I have the following dataset saved in a Pandas dataframe - note the last column [Status] is the column I'd like to create:

Department  Employee    Issue Date  Submission Date ***Status***
A   Joe 18/05/2014  25/06/2014  0
A   Joe 1/06/2014   28/06/2014  1
A   Joe 23/06/2014  30/06/2014  2
A   Mark    1/03/2015   13/03/2015  0
A   Mark    23/04/2015  15/04/2015  0
A   William 15/07/2016  30/07/2016  0
A   William 1/08/2016   23/08/2016  0
A   William 20/08/2016  19/08/2016  1
B   Liz 18/05/2014  7/06/2014   0
B   Liz 1/06/2014   15/06/2014  1
B   Liz 23/06/2014  16/06/2014  0
B   John    1/03/2015   13/03/2015  0
B   John    23/04/2015  15/04/2015  0
B   Alex    15/07/2016  30/07/2016  0
B   Alex    1/08/2016   23/08/2016  0
B   Alex    20/08/2016  19/08/2016  1

I'd like to create an additional column [Status] based on the following conditions:

For every unique [Department] & [Employee] combination (e.g. there are three rows corresponding to Joe in Department A), sort the [Issue Date] from oldest to newest
If the current row [Issue Date] is greater than ALL previous rows [Submission Date], then flag the [Status] with 0; else [Status] = no of times that [Issue Date] < [Submission Date]

As an example: for employee Joe in Department A. When [Issue Date] = '1/06/2014', the previous row's [Submission Date] is after the [Issue Date], therefore [Status] = 1 for row 2. Similarly, when [Issue Date] = '23/06/2014', row 1 & 2's [Submission Date]s are both after the [Issue Date], therefore [Status] = 2 for row 3. We need to perform this calculation for every unique combination of Department and Employee.

Note: real dataset is not sorted nicely as the displayed example.

Upvotes: 2

Answers (1)

Ian Thompson

Reputation: 3305

This question was posted 6 months ago but hopefully my answer still provides some help.

First, import the libraries and make the dataframe:

# import libraries
import numpy as np
import pandas as pd

# Make DataFrame
df = pd.DataFrame({'Department' : ['A']*8 + ['B']*8,
                   'Employee' : ['Joe']*3 +\
                                ['Mark']*2 +\
                                ['William']*3 +\
                                ['Liz']*3 +\
                                ['John']*2 +\
                                ['Alex']*3,
                   'Issue Date' : ['18/05/2014', '1/06/2014', '23/06/2014',
                                   '1/03/2015', '23/04/2015',
                                   '15/07/2016', '1/08/2016', '20/08/2016',
                                   '18/05/2014', '1/06/2014', '23/06/2014',
                                   '1/03/2015', '23/04/2015',
                                   '15/07/2016', '1/08/2016', '20/08/2016'],
                   'Submission Date' : ['25/06/2014', '28/06/2014', '30/06/2014',
                                        '13/03/2015', '15/04/2015',
                                        '30/07/2016', '23/08/2016', '19/08/2016',
                                        '7/06/2014', '15/06/2014', '16/06/2014',
                                        '13/03/2015', '15/04/2015',
                                        '30/07/2016', '23/08/2016', '19/08/2016']})

Second, convert Issue Date and Submission Date to datetime:

    # Convert 'Issue Date', 'Submission Date' to pd.datetime
df.loc[:, 'Issue Date'] = pd.to_datetime(df.loc[:, 'Issue Date'],
                                         dayfirst = True)
df.loc[:, 'Submission Date'] = pd.to_datetime(df.loc[:, 'Submission Date'],
                                              dayfirst = True)

Third, reset the index and sort the values by Department, Employee, and Issue Date:

# Reset index and sort_values by 'Department', 'Employee', 'Issue Date'
df.reset_index(drop = True).sort_values(by = ['Department',
                                              'Employee',
                                              'Issue Date'],
                                        inplace = True)

Fourth, group by Department, Employee; cumulative count the rows; insert into the originial df:

# Group by 'Department', 'Employee'; cumulative count rows; insert into original df
df.insert(df.shape[1],
          'grouped count',
          df.groupby(['Department',
                      'Employee']).cumcount())

Fifth, create a no_issue and no_submission dataframe and merge them together on Department and Employee:

# Create df without 'Issue Date'
no_issue = df.drop('Issue Date', axis = 1)

# Create df without 'Submission Date'
no_submission = df.drop('Submission Date', axis = 1)

# Outer merge no_issue with no_submission on 'Department', 'Employee'
merged = no_issue.merge(no_submission,
                        how = 'outer',
                        on = ['Department',
                              'Employee'])

This duplicates the Submission Date by the number of Issue Dates per Department,Employee group

Here's what it looks like for Joe:

Sixth, create a dataframe that only keeps rows where grouped count_x is less than grouped count_y, then sort by Department, Employee, and Issue Date:

# Create merged1 df that keeps only rows where 'grouped count_x' < 'grouped count_y';
# sort by 'Department', 'Employee', 'Issue Date
merged1 = merged[merged.loc[:, 'grouped count_x'] <
                 merged.loc[:, 'grouped count_y']].sort_values(by = ['Department',
                                                                     'Employee',
                                                                     'Issue Date'])

Seventh, insert the status column as a boolean where Issue Date is less than Submission Date:

# Insert 'Status' as a boolean when 'Issue Date' < 'Submission Date'
merged1.insert(merged.shape[1],
               'Status',
               merged1.loc[:, 'Issue Date'] < merged1.loc[:, 'Submission Date'])

Eighth, group by Department, Employee, and Issue Date, sum the Status, and reset the index:

# Group by 'Department', 'Employee', 'Issue Date' and sum 'Status'; reset index
merged1 = merged1.groupby(['Department',
                           'Employee',
                           'Issue Date']).agg({'Status' : np.sum}).reset_index()

This will return a dataframe with all the correct Statuses minus the minimum Issue Date for each Department, Employee group

Ninth, group the original merged dataframe by Department and Employee, find the minimum Issue Date, and reset the index:

# Group merged by 'Department', 'Employee' and find min 'Issue Date'; reset index
merged = merged.groupby(['Department',
                         'Employee']).agg({'Issue Date' : 'min'}).reset_index()

Tenth, concatenate merged1 with merged, fill the na with 0 (since the minimum Issue Date will always have a Status of 0) and sort by Department, Employee, and Issue Date:

# Concatenate merged with merged1; fill na with 0; sort by 'Department', 'Employee', 'Issue Date'
concatenated = pd.concat([merged1, merged]).fillna(0).sort_values(by = ['Department',
                                                                        'Employee',
                                                                        'Issue Date'])

Eleventh, inner merge the merged dataframe with the concatenated dataframe on Department, Employee, and Issue Date, then drop the grouped count:

# Merge concatenated with df; drop grouped count
final = df.merge(concatenated,
                 how = 'inner',
                 on = ['Department',
                       'Employee',
                       'Issue Date']).drop('grouped count',
                                           axis = 1)

Voila! Here is your final dataframe:

# Final df
final

Upvotes: 1

Pandas dataframe: Create additional column based on date columns comparison

Answers (1)

Related Questions