wjie08
wjie08

Reputation: 445

Retrieving second last unique value with repeated values in column Python

I have a dataset with email addresses, dates and job function. Am keen to get the current job function and the previous job function (the value that is different from the current job function and also the job function that was held before the current job function). For example, [email protected] is currently swe_mgr2 and had a previous job function of swe_mgr1. Am also keen to capture the duration of the previous job function. The duration is tricky to calculate because of the start dates that are captured in random fashion, but it can be captured based on the first row of the start date of the previous job function till the first row of the start date of the current job function. For example, [email protected] had a previous job function of swe_mgr1 which lasted from 30-08-2018 till 01-06-2019 (ie 10 mths).

dataset

email              startdate       jobfunction
[email protected]     01-01-2018      swe_ic1
[email protected]     01-03-2018      swe_ic2
[email protected]     30-08-2018      swe_mgr1
[email protected]     01-06-2019      swe_mgr2  
[email protected]     01-06-2020      swe_mgr2
[email protected]     30-01-2018      mkt_ic2
[email protected]     01-06-2018      mkt_ic3 
[email protected]     07-09-2018      mkt_mgr1
[email protected]     12-12-2018      mkt_mgr2
[email protected]     15-01-2019      mkt_mgr2 
[email protected]     05-06-2019      mkt_mgr2
[email protected]     01-06-2020      mkt_mgr3
[email protected]   01-06-2019      sales_ic1
[email protected]   01-06-2020      sales_mgr1

The expected output is

email             current_function     previous_function      duration_previous_function
[email protected]    swe_mgr2             swe_mgr1                10mths
[email protected]    mkt_mgr3             mkt_mgr2                18mths
[email protected]  sales_mgr1           sales_ic1               12mths

I am stuck at the first step of trying to get the previous job function,

This code seems to work at retrieving the current job function, but not the previous job function

df2 = df.groupby('email').last().sort_index().reset_index().drop_duplicates()

Am also wondering if this could be achieved using by looping through each email address but the code below does not work as well

emails = df['email']
assigndate = df['startdate']
jobname = df['jobfunction']

for i in emails:
    prevjob = jobname.apply(lambda x: x.unique([-2]))

Appreciate any form of help and assistance, thank you.

Upvotes: 1

Views: 71

Answers (1)

jezrael
jezrael

Reputation: 862851

You can first after converting to datetimes sort columns by DataFrame.sort_values and get last duplicates by DataFrame.drop_duplicates, then create counter by GroupBy.cumcount for filter first 2 rows and later for new level in MultiIndex created by DataFrame.set_index and DataFrame.unstack:

df['startdate'] = pd.to_datetime(df['startdate'], dayfirst=True)

df = (df.sort_values(['email','startdate'], ascending=[True, False])
        .drop_duplicates(['email','jobfunction'], keep='last'))
df['g'] = df.groupby('email').cumcount()
df1 = df[df['g'].lt(2)].copy()
df1 = (df1.set_index(['email','g'])
          .unstack()
          .rename(columns={0:'current',1:'previous'}))
df1.columns = [f'{b}_{a}' for a,b in df1.columns]
df1 = df1.reset_index()

Last subtract columns converted to months periods by DataFrame.pop and Series.dt.to_period:

df1['duration_previous_function'] = (df1.pop('current_startdate')
                                        .dt.to_period('m')
                                        .astype('int')
                                        .sub(df1.pop('previous_startdate')
                                                .dt.to_period('m')
                                                .astype('int')))
print (df1)
              email current_jobfunction previous_jobfunction  \
0    [email protected]            mkt_mgr3             mkt_mgr2   
1    [email protected]            swe_mgr2             swe_mgr1   
2  [email protected]          sales_mgr1            sales_ic1   

   duration_previous_function  
0                          18  
1                          10  
2                          12  

Upvotes: 1

Related Questions