Reputation: 445
I have a dataset with email addresses, dates and job function. Am keen to get the current job function and the previous job function (the value that is different from the current job function and also the job function that was held before the current job function). For example, [email protected] is currently swe_mgr2 and had a previous job function of swe_mgr1. Am also keen to capture the duration of the previous job function. The duration is tricky to calculate because of the start dates that are captured in random fashion, but it can be captured based on the first row of the start date of the previous job function till the first row of the start date of the current job function. For example, [email protected] had a previous job function of swe_mgr1 which lasted from 30-08-2018 till 01-06-2019 (ie 10 mths).
dataset
email startdate jobfunction
[email protected] 01-01-2018 swe_ic1
[email protected] 01-03-2018 swe_ic2
[email protected] 30-08-2018 swe_mgr1
[email protected] 01-06-2019 swe_mgr2
[email protected] 01-06-2020 swe_mgr2
[email protected] 30-01-2018 mkt_ic2
[email protected] 01-06-2018 mkt_ic3
[email protected] 07-09-2018 mkt_mgr1
[email protected] 12-12-2018 mkt_mgr2
[email protected] 15-01-2019 mkt_mgr2
[email protected] 05-06-2019 mkt_mgr2
[email protected] 01-06-2020 mkt_mgr3
[email protected] 01-06-2019 sales_ic1
[email protected] 01-06-2020 sales_mgr1
The expected output is
email current_function previous_function duration_previous_function
[email protected] swe_mgr2 swe_mgr1 10mths
[email protected] mkt_mgr3 mkt_mgr2 18mths
[email protected] sales_mgr1 sales_ic1 12mths
I am stuck at the first step of trying to get the previous job function,
This code seems to work at retrieving the current job function, but not the previous job function
df2 = df.groupby('email').last().sort_index().reset_index().drop_duplicates()
Am also wondering if this could be achieved using by looping through each email address but the code below does not work as well
emails = df['email']
assigndate = df['startdate']
jobname = df['jobfunction']
for i in emails:
prevjob = jobname.apply(lambda x: x.unique([-2]))
Appreciate any form of help and assistance, thank you.
Upvotes: 1
Views: 71
Reputation: 862851
You can first after converting to datetimes sort columns by DataFrame.sort_values
and get last duplicates by DataFrame.drop_duplicates
, then create counter by GroupBy.cumcount
for filter first 2 rows and later for new level in MultiIndex
created by DataFrame.set_index
and DataFrame.unstack
:
df['startdate'] = pd.to_datetime(df['startdate'], dayfirst=True)
df = (df.sort_values(['email','startdate'], ascending=[True, False])
.drop_duplicates(['email','jobfunction'], keep='last'))
df['g'] = df.groupby('email').cumcount()
df1 = df[df['g'].lt(2)].copy()
df1 = (df1.set_index(['email','g'])
.unstack()
.rename(columns={0:'current',1:'previous'}))
df1.columns = [f'{b}_{a}' for a,b in df1.columns]
df1 = df1.reset_index()
Last subtract columns converted to months periods by DataFrame.pop
and Series.dt.to_period
:
df1['duration_previous_function'] = (df1.pop('current_startdate')
.dt.to_period('m')
.astype('int')
.sub(df1.pop('previous_startdate')
.dt.to_period('m')
.astype('int')))
print (df1)
email current_jobfunction previous_jobfunction \
0 [email protected] mkt_mgr3 mkt_mgr2
1 [email protected] swe_mgr2 swe_mgr1
2 [email protected] sales_mgr1 sales_ic1
duration_previous_function
0 18
1 10
2 12
Upvotes: 1