Stacey
Stacey

Reputation: 5107

Looking back at previous row in data-frame and selecting specific records

I have a dataframe df which looks like:

           name     year    dept         metric
0   Steve Jones     2018       A    0.703300236
1   Steve Jones     2019       A    0.255587222
2   Jane Smith      2018       A    0.502505934
3   Jane Smith      2019       B    0.698808749
4   Barry Evans     2019       B    0.941325241
5   Tony Edwards    2017       B    0.880940126
6   Tony Edwards    2018       B    0.649086123
7   Tony Edwards    2019       A    0.881365905

I would like to create 2 new data-frame which contains the records where someone has moved from dept A to B and and another where someone has moved from dept B to A. Therefore my desired output is:

            name        year    dept         metric
0     Jane Smith        2018       A    0.502505934
1     Tony Edwards      2019       B    0.649086123

            name        year    dept         metric
0     Jane Smith        2019       B    0.698808749
1   Tony Edwards        2018       B    0.881365905

Where records for the year the last year that someone is in their old dept are captured in one data-frame and the first year in the new dept are captured in another only. The records are sorted by name and year so will be in the correct order.

I've tried :

for row in agg_data.rows:

    df['match'] = np.where(df.dept == 'A' and  df.dept.shift() =='B','1')
    df['match'] = np.where(df.dept == 'B' and  df.dept.shift() =='A','2')

and then select out the records into a data-frame but I get it to work.

Upvotes: 1

Views: 83

Answers (3)

Andy L.
Andy L.

Reputation: 25269

I come up with a solution using drop_duplicates, groupby and rank. Creating df2 on rank=2 and creating df1 on rank==1 and name exists in df2

df['rk'] = df.sort_values(['name', 'dept', 'year']).drop_duplicates(['name', 'dept'], keep='last').groupby('name').year.rank()
df2 = df[df.rk.eq(2)].drop('rk', 1)
df1 = df[df.rk.eq(1) & df.name.isin(df2.name)].drop('rk', 1)


df1:
           name  year dept    metric
2    Jane Smith  2018    A  0.502506
6  Tony Edwards  2018    B  0.649086

df2:
           name  year dept    metric
3    Jane Smith  2019    B  0.698809
7  Tony Edwards  2019    A  0.881366

Upvotes: 0

Serge Ballesta
Serge Ballesta

Reputation: 149125

You could join the initial dataframe with a shift of itself to have convecutive rows on same line. Then you ask the departments you want requiring the names to be the same and you get the indices of one of the expected rows, the other row just has an adjacent index. It gives:

df = agg_data.join(agg_data.shift(), rsuffix='_old')

df1 = df[(df.name_old==df.name)&(df.dept_old=='A')&(df.dept=='B')]
print(pd.concat([agg_data.loc[df1.index], agg_data.loc[df1.index-1]]
                ).sort_index())

df2 = df[(df.name_old==df.name)&(df.dept_old=='B')&(df.dept=='A')]
print(pd.concat([agg_data.loc[df2.index], agg_data.loc[df2.index-1]]
                ).sort_index())

with following output:

         name  year dept    metric
2  Jane Smith  2018    A  0.502506
3  Jane Smith  2019    B  0.698809
           name  year dept    metric
6  Tony Edwards  2018    B  0.649086
7  Tony Edwards  2019    A  0.881366

Upvotes: 0

jezrael
jezrael

Reputation: 863301

I believe you need:

df = df[df.groupby('name')['dept'].transform('nunique') > 1]
df = df.drop_duplicates(['name','dept'], keep='last')

df1 = df.drop_duplicates('name')
print (df1)
           name  year dept    metric
2    Jane Smith  2018    A  0.502506
6  Tony Edwards  2018    B  0.649086

df2 = df.drop_duplicates('name', keep='last')
print (df2)
           name  year dept    metric
3    Jane Smith  2019    B  0.698809
7  Tony Edwards  2019    A  0.881366

Upvotes: 1

Related Questions