Creating new column according to the closest last date in Pandas dataframe

Question

I have a pandas dataframe that looks like

data = {
'Date': ['2024-07-14','2024-07-14','2024-07-14','2024-07-14','2024-07-14','2024-03-14','2024-03-14','2024-03-14','2024-02-14','2024-02-10','2024-02-10','2024-02-10','2024-04-13','2024-04-13','2023-02-11','2023-02-11','2023-02-11','2011-10-11','2011-05-02','2011-05-02'],
'Test_Number': [5,4,3,2,1,3,2,1,4,3,2,1,2,1,3,2,1,1,2,1],
'Student_ID': [2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1],
'Place': [3,5,7,3,1,9,6,3,7,8,2,1,3,4,2,1,5,6,2,7]
}
df = pd.DataFrame(data)

and I would like to create three new columns 'student_rec_1', 'student_rec_2', 'student_rec_3' using the following method:

for each Student_ID, student_rec_1 is equal to the Place of that student in the last test in the closest last date, and is equal to np.nan if it does not exist.

Similarly, student_rec_2 is equal to the Place of that student in the second last test in the closest last date, and is equal to np.nan if it does not exist,

student_rec_3 is equal to the Place of that student in the third last test in the closest last date, and is equal to np.nan if it does not exist. So the desired outcome looks like

data_new = {
'Date': ['2024-07-14','2024-07-14','2024-07-14','2024-07-14','2024-07-14','2024-03-14','2024-03-14','2024-03-14','2024-02-14','2024-02-10','2024-02-10','2024-02-10','2024-04-13','2024-04-13','2023-02-11','2023-02-11','2023-02-11','2011-10-11','2011-05-02','2011-05-02'],
'Test_Number': [5,4,3,2,1,3,2,1,4,3,2,1,2,1,3,2,1,1,2,1],
'Student_ID': [2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1],
'Place': [3,5,7,3,1,9,6,3,7,8,2,1,3,4,2,1,5,6,2,7],
'student_rec_1': [9,9,9,9,9,7,7,7,8,np.nan,np.nan,np.nan,2,2,6,6,6,2,np.nan,np.nan],
'student_rec_2': [6,6,6,6,6,8,8,8,2,np.nan,np.nan,np.nan,1,1,2,2,2,7,np.nan,np.nan],
'student_rec_3': [3,3,3,3,3,2,2,2,1,np.nan,np.nan,np.nan,5,5,7,7,7,np.nan,np.nan,np.nan]
}
df_new = pd.DataFrame(data_new)

That's what I have tried:

df['Date'] = pd.to_datetime(df['Date'])

df = df.sort_values(['Date', 'Test_Number'], ascending=[False, False])

def get_last_n_records(group, n): return group['Place'].shift(-n)

df['student_rec_1'] = df.groupby('Student_ID').apply(get_last_n_records, 1).reset_index(level=0, drop=True) df['student_rec_2'] = df.groupby('Student_ID').apply(get_last_n_records, 2).reset_index(level=0, drop=True) df['student_rec_3'] = df.groupby('Student_ID').apply(get_last_n_records, 3).reset_index(level=0, drop=True)

but it just shifted the student's place for each student and didn't account for the "last day" aspect and would just shift the Place irregardless.

jezrael · Accepted Answer

First convert column Date by to_datetime, create helper DataFrame with rename columns df_cand so possible use left join to original (for avoid remove original index is used rename). Then filter by datetimes, sorting and create counter by GroupBy.cumcount for get 3 last values, which are merged to original df:

df['Date'] = pd.to_datetime(df['Date'])

df = df.reset_index().rename(columns={'index':'orig_index'})

df_cand = (df.rename(columns={'Date':'cand_Date',
                             'Test_Number':'cand_Test_Number',
                             'Place':'cand_Place'})
             .drop(['orig_index'], axis=1))

merged = df.merge(df_cand, on='Student_ID', how='left')

merged = merged[merged['cand_Date'].lt(merged['Date'])]
merged = merged.sort_values(['Student_ID','orig_index','cand_Date','cand_Test_Number'],
                             ascending=[True,True,False,False])

merged['cand_rank'] = merged.groupby('orig_index').cumcount().add(1)

pivot = (merged[merged['cand_rank'].le(3)]
          .pivot(index='orig_index',columns='cand_rank',values='cand_Place')
          .add_prefix('student_rec'))

out = df.join(pivot).drop('orig_index', axis=1)

print(out)

         Date  Test_Number  Student_ID  Place  student_rec_1  student_rec_2  \
0  2024-07-14            5           2      3            9.0            6.0   
1  2024-07-14            4           2      5            9.0            6.0   
2  2024-07-14            3           2      7            9.0            6.0   
3  2024-07-14            2           2      3            9.0            6.0   
4  2024-07-14            1           2      1            9.0            6.0   
5  2024-03-14            3           2      9            7.0            8.0   
6  2024-03-14            2           2      6            7.0            8.0   
7  2024-03-14            1           2      3            7.0            8.0   
8  2024-02-14            4           2      7            8.0            2.0   
9  2024-02-10            3           2      8            NaN            NaN   
10 2024-02-10            2           2      2            NaN            NaN   
11 2024-02-10            1           2      1            NaN            NaN   
12 2024-04-13            2           1      3            2.0            1.0   
13 2024-04-13            1           1      4            2.0            1.0   
14 2023-02-11            3           1      2            6.0            2.0   
15 2023-02-11            2           1      1            6.0            2.0   
16 2023-02-11            1           1      5            6.0            2.0   
17 2011-10-11            1           1      6            2.0            7.0   
18 2011-05-02            2           1      2            NaN            NaN   
19 2011-05-02            1           1      7            NaN            NaN   

    student_rec_3  
0             3.0  
1             3.0  
2             3.0  
3             3.0  
4             3.0  
5             2.0  
6             2.0  
7             2.0  
8             1.0  
9             NaN  
10            NaN  
11            NaN  
12            5.0  
13            5.0  
14            7.0  
15            7.0  
16            7.0  
17            NaN  
18            NaN  
19            NaN

EDIT: For better performance is possible use solution working per groups with numpy - compare dates for all previous to mask, create order by cumulative sum by numpy.cumsum, so possible get N top ordering with numpy.argmax. Because there is possible some values not exist is necessary add condition with numpy.any and return necessary columns:

df['Date'] = pd.to_datetime(df['Date'])

N = 3

def f(x):

    dates = x['Date'].to_numpy()        
    places = x['Place'].astype(float).to_numpy() 

    mask = dates < dates[:, None]  
    cs = np.cumsum(mask, axis=1) 
    targets = np.array(range(1, N+1))[None, :] 
    cs_ext = cs[..., None]

    cond = cs_ext == targets
    first_idx = np.argmax(cond, axis=1)
    m = np.any(cond, axis=1) 

    arr = places[first_idx]  
    arr[~m] = np.nan

    return pd.DataFrame(arr, 
                        index=x.index, 
                        columns=[f'student_rec_{i+1}' for i in range(N)])


out = df.join(df.groupby('Student_ID', group_keys=False)[['Place','Date']].apply(f))

print(out)
         Date  Test_Number  Student_ID  Place  student_rec_1  student_rec_2  \
0  2024-07-14            5           2      3            9.0            6.0   
1  2024-07-14            4           2      5            9.0            6.0   
2  2024-07-14            3           2      7            9.0            6.0   
3  2024-07-14            2           2      3            9.0            6.0   
4  2024-07-14            1           2      1            9.0            6.0   
5  2024-03-14            3           2      9            7.0            8.0   
6  2024-03-14            2           2      6            7.0            8.0   
7  2024-03-14            1           2      3            7.0            8.0   
8  2024-02-14            4           2      7            8.0            2.0   
9  2024-02-10            3           2      8            NaN            NaN   
10 2024-02-10            2           2      2            NaN            NaN   
11 2024-02-10            1           2      1            NaN            NaN   
12 2024-04-13            2           1      3            2.0            1.0   
13 2024-04-13            1           1      4            2.0            1.0   
14 2023-02-11            3           1      2            6.0            2.0   
15 2023-02-11            2           1      1            6.0            2.0   
16 2023-02-11            1           1      5            6.0            2.0   
17 2011-10-11            1           1      6            2.0            7.0   
18 2011-05-02            2           1      2            NaN            NaN   
19 2011-05-02            1           1      7            NaN            NaN   

    student_rec_3  
0             3.0  
1             3.0  
2             3.0  
3             3.0  
4             3.0  
5             2.0  
6             2.0  
7             2.0  
8             1.0  
9             NaN  
10            NaN  
11            NaN  
12            5.0  
13            5.0  
14            7.0  
15            7.0  
16            7.0  
17            NaN  
18            NaN  
19            NaN

Creating new column according to the closest last date in Pandas dataframe

Answers (1)

Related Questions