Retrieving second last unique value with repeated values in column Python

Question

I have a dataset with email addresses, dates and job function. Am keen to get the current job function and the previous job function (the value that is different from the current job function and also the job function that was held before the current job function). For example, john.k@abc.com is currently swe_mgr2 and had a previous job function of swe_mgr1. Am also keen to capture the duration of the previous job function. The duration is tricky to calculate because of the start dates that are captured in random fashion, but it can be captured based on the first row of the start date of the previous job function till the first row of the start date of the current job function. For example, john.k@abc.com had a previous job function of swe_mgr1 which lasted from 30-08-2018 till 01-06-2019 (ie 10 mths).

dataset

email              startdate       jobfunction
john.k@abc.com     01-01-2018      swe_ic1
john.k@abc.com     01-03-2018      swe_ic2
john.k@abc.com     30-08-2018      swe_mgr1
john.k@abc.com     01-06-2019      swe_mgr2  
john.k@abc.com     01-06-2020      swe_mgr2
greg.h@abc.com     30-01-2018      mkt_ic2
greg.h@abc.com     01-06-2018      mkt_ic3 
greg.h@abc.com     07-09-2018      mkt_mgr1
greg.h@abc.com     12-12-2018      mkt_mgr2
greg.h@abc.com     15-01-2019      mkt_mgr2 
greg.h@abc.com     05-06-2019      mkt_mgr2
greg.h@abc.com     01-06-2020      mkt_mgr3
joseph.c@abc.com   01-06-2019      sales_ic1
joseph.c@abc.com   01-06-2020      sales_mgr1

The expected output is

email             current_function     previous_function      duration_previous_function
john.k@abc.com    swe_mgr2             swe_mgr1                10mths
greg.h@abc.com    mkt_mgr3             mkt_mgr2                18mths
joseph.c@abc.com  sales_mgr1           sales_ic1               12mths

I am stuck at the first step of trying to get the previous job function,

This code seems to work at retrieving the current job function, but not the previous job function

df2 = df.groupby('email').last().sort_index().reset_index().drop_duplicates()

Am also wondering if this could be achieved using by looping through each email address but the code below does not work as well

emails = df['email']
assigndate = df['startdate']
jobname = df['jobfunction']

for i in emails:
    prevjob = jobname.apply(lambda x: x.unique([-2]))

Appreciate any form of help and assistance, thank you.

jezrael · Accepted Answer

You can first after converting to datetimes sort columns by DataFrame.sort_values and get last duplicates by DataFrame.drop_duplicates, then create counter by GroupBy.cumcount for filter first 2 rows and later for new level in MultiIndex created by DataFrame.set_index and DataFrame.unstack:

df['startdate'] = pd.to_datetime(df['startdate'], dayfirst=True)

df = (df.sort_values(['email','startdate'], ascending=[True, False])
        .drop_duplicates(['email','jobfunction'], keep='last'))
df['g'] = df.groupby('email').cumcount()
df1 = df[df['g'].lt(2)].copy()
df1 = (df1.set_index(['email','g'])
          .unstack()
          .rename(columns={0:'current',1:'previous'}))
df1.columns = [f'{b}_{a}' for a,b in df1.columns]
df1 = df1.reset_index()

Last subtract columns converted to months periods by DataFrame.pop and Series.dt.to_period:

df1['duration_previous_function'] = (df1.pop('current_startdate')
                                        .dt.to_period('m')
                                        .astype('int')
                                        .sub(df1.pop('previous_startdate')
                                                .dt.to_period('m')
                                                .astype('int')))
print (df1)
              email current_jobfunction previous_jobfunction  \
0    greg.h@abc.com            mkt_mgr3             mkt_mgr2   
1    john.k@abc.com            swe_mgr2             swe_mgr1   
2  joseph.c@abc.com          sales_mgr1            sales_ic1   

   duration_previous_function  
0                          18  
1                          10  
2                          12

Retrieving second last unique value with repeated values in column Python

Answers (1)

Related Questions