Apook
Apook

Reputation: 449

Retrieve last row of data with conditions

I have the following large dataset recording the result of a math competition among students in descending order of date: So for example, student 1 comes third in Race 1 while student 3 won Race 2, etc.

Race_ID   Date           Student_ID      Rank  
21         1/1/2023       1               3     
21         1/1/2023       2               2     
21         1/1/2023       3               1     
21         1/1/2023       4               4     
25         11/9/2022      1               2     
25         11/9/2022      2               3     
25         11/9/2022      3               1     
3          17/4/2022      5               4     
3          17/4/2022      2               1     
3          17/4/2022      3               2     
3          17/4/2022      4               3     
14         1/3/2022       1               1     
14         1/3/2022       2               2     
85         1/1/2021       1               2     
85         1/1/2021       2               3     
85         1/1/2021       3               1     

And I want to create a new column called Last_win which returns the Race_ID of the last time that student won (i.e. rank number 1). So the outcome should look like

Race_ID   Date           Student_ID      Rank    Last_win
21         1/1/2023       1               3       14
21         1/1/2023       2               2       3
21         1/1/2023       3               1       25
21         1/1/2023       4               4       NaN
25         11/9/2022      1               2       14
25         11/9/2022      2               3       3
25         11/9/2022      3               1       85
3          17/4/2022      5               4       NaN
3          17/4/2022      2               1       NaN
3          17/4/2022      3               2       85
3          17/4/2022      4               3       NaN
14         1/3/2022       1               1       NaN
14         1/3/2022       2               2       NaN
85         1/1/2021       1               2       NaN
85         1/1/2021       2               3       NaN
85         1/1/2021       3               1       NaN

Thank you so much inadvacne.

Upvotes: 1

Views: 68

Answers (2)

Corralien
Corralien

Reputation: 120391

Keep rows where Rank is 1 and hide others then group by Student. Finally, fill backward Race_ID and don't forget to shift the result:

# Convert Date column to DatetimeIndex if needed
# df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)

df['Last_win'] = (df.where(df['Rank'] == 1)
                    .groupby(df['Student_ID'])['Race_ID']
                    .transform(lambda x: x.bfill().shift(-1)))
print(df)

# Output
    Race_ID       Date  Student_ID  Rank  Last_win
0        21 2023-01-01           1     3      14.0
1        21 2023-01-01           2     2       3.0
2        21 2023-01-01           3     1      25.0
3        21 2023-01-01           4     4       NaN
4        25 2022-09-11           1     2      14.0
5        25 2022-09-11           2     3       3.0
6        25 2022-09-11           3     1      85.0
7         3 2022-04-17           5     4       NaN
8         3 2022-04-17           2     1       NaN
9         3 2022-04-17           3     2      85.0
10        3 2022-04-17           4     3       NaN
11       14 2022-03-01           1     1       NaN
12       14 2022-03-01           2     2       NaN
13       85 2021-01-01           1     2       NaN
14       85 2021-01-01           2     3       NaN
15       85 2021-01-01           3     1       NaN

Upvotes: 2

jezrael
jezrael

Reputation: 862511

Use merge_asof for merge previous datetimes by parameter allow_exact_matches=False:

df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)

df = df.sort_values('Date')

df1 = (df.loc[df['Rank'].eq(1), ['Date','Student_ID','Race_ID']]
         .rename(columns={'Race_ID':'Last_win'}))

df = (pd.merge_asof(df.reset_index(), 
                    df1, on='Date', allow_exact_matches=False, by='Student_ID')
        .sort_values('index', ignore_index=True)
        .drop('index', axis=1))
      
print (df)
    Race_ID       Date  Student_ID  Rank  Last_win
0        21 2023-01-01           1     3      14.0
1        21 2023-01-01           2     2       3.0
2        21 2023-01-01           3     1      25.0
3        21 2023-01-01           4     4       NaN
4        25 2022-09-11           1     2      14.0
5        25 2022-09-11           2     3       3.0
6        25 2022-09-11           3     1      85.0
7         3 2022-04-17           5     4       NaN
8         3 2022-04-17           2     1       NaN
9         3 2022-04-17           3     2      85.0
10        3 2022-04-17           4     3       NaN
11       14 2022-03-01           1     1       NaN
12       14 2022-03-01           2     2       NaN
13       85 2021-01-01           1     2       NaN
14       85 2021-01-01           2     3       NaN
15       85 2021-01-01           3     1       NaN

Upvotes: 1

Related Questions