Reputation: 5107
I have a dataframe df
which looks like:
name year dept metric
0 Steve Jones 2018 A 0.703300236
1 Steve Jones 2019 A 0.255587222
2 Jane Smith 2018 A 0.502505934
3 Jane Smith 2019 B 0.698808749
4 Barry Evans 2019 B 0.941325241
5 Tony Edwards 2017 B 0.880940126
6 Tony Edwards 2018 B 0.649086123
7 Tony Edwards 2019 A 0.881365905
I would like to create 2 new data-frame which contains the records where someone has moved from dept A to B and and another where someone has moved from dept B to A. Therefore my desired output is:
name year dept metric
0 Jane Smith 2018 A 0.502505934
1 Tony Edwards 2019 B 0.649086123
name year dept metric
0 Jane Smith 2019 B 0.698808749
1 Tony Edwards 2018 B 0.881365905
Where records for the year the last year that someone is in their old dept are captured in one data-frame and the first year in the new dept are captured in another only. The records are sorted by name and year so will be in the correct order.
I've tried :
for row in agg_data.rows:
df['match'] = np.where(df.dept == 'A' and df.dept.shift() =='B','1')
df['match'] = np.where(df.dept == 'B' and df.dept.shift() =='A','2')
and then select out the records into a data-frame but I get it to work.
Upvotes: 1
Views: 83
Reputation: 25269
I come up with a solution using drop_duplicates
, groupby
and rank
. Creating df2
on rank=2
and creating df1
on rank==1
and name
exists in df2
df['rk'] = df.sort_values(['name', 'dept', 'year']).drop_duplicates(['name', 'dept'], keep='last').groupby('name').year.rank()
df2 = df[df.rk.eq(2)].drop('rk', 1)
df1 = df[df.rk.eq(1) & df.name.isin(df2.name)].drop('rk', 1)
df1:
name year dept metric
2 Jane Smith 2018 A 0.502506
6 Tony Edwards 2018 B 0.649086
df2:
name year dept metric
3 Jane Smith 2019 B 0.698809
7 Tony Edwards 2019 A 0.881366
Upvotes: 0
Reputation: 149125
You could join the initial dataframe with a shift of itself to have convecutive rows on same line. Then you ask the departments you want requiring the names to be the same and you get the indices of one of the expected rows, the other row just has an adjacent index. It gives:
df = agg_data.join(agg_data.shift(), rsuffix='_old')
df1 = df[(df.name_old==df.name)&(df.dept_old=='A')&(df.dept=='B')]
print(pd.concat([agg_data.loc[df1.index], agg_data.loc[df1.index-1]]
).sort_index())
df2 = df[(df.name_old==df.name)&(df.dept_old=='B')&(df.dept=='A')]
print(pd.concat([agg_data.loc[df2.index], agg_data.loc[df2.index-1]]
).sort_index())
with following output:
name year dept metric
2 Jane Smith 2018 A 0.502506
3 Jane Smith 2019 B 0.698809
name year dept metric
6 Tony Edwards 2018 B 0.649086
7 Tony Edwards 2019 A 0.881366
Upvotes: 0
Reputation: 863301
I believe you need:
df = df[df.groupby('name')['dept'].transform('nunique') > 1]
df = df.drop_duplicates(['name','dept'], keep='last')
df1 = df.drop_duplicates('name')
print (df1)
name year dept metric
2 Jane Smith 2018 A 0.502506
6 Tony Edwards 2018 B 0.649086
df2 = df.drop_duplicates('name', keep='last')
print (df2)
name year dept metric
3 Jane Smith 2019 B 0.698809
7 Tony Edwards 2019 A 0.881366
Upvotes: 1