Reputation: 1723
I have a dataframe that measure the student performance student
as below:
ID TestDate PerformanceStatus (PS)
1 15/03/2016 0
1 01/04/2016 2
1 05/05/2016 1
1 07/06/2016 1
2 15/03/2016 0
2 01/04/2016 2
2 05/05/2016 1
2 07/06/2016 3
2 23/08/2016 1
I want to update my table to have a new column PreviousPerformanceStatus.
This PreviousPerformanceStatus is calculated based on the performanceStatus monitored, as below:
Note: If there is not performanceStatus recorded before the TestDate, I want to make the PreviousPerformanceStatus = PerformanceStatus
ID TestDate PS PreviousPerformanceStatus
1 15/03/2016 0 0
1 01/04/2016 2 0
1 05/05/2016 1 2
1 07/06/2016 1 1
2 15/03/2016 0 0
2 01/04/2016 2 0
2 05/05/2016 1 2
2 07/06/2016 3 1
2 23/08/2016 1 3
I can do it with an SQL-statement, but how do I update my Dataframe with pandas. Thanks.
Example: (for with ID=1) The previousPerformanceStatus is calculated based on the PerformanceStatus from the "earlier" test date., so when TestDate=01/04/2016, I want to use the data from TestDate=15/03/2016. However, if I can't find any previous data, I will default the PreviousPerformanceStatus with the value in the PerformanceStatus
Upvotes: 0
Views: 79
Reputation: 33793
Perform a groupby
on 'ID'
and use shift
and bfill
:
# Ensure that the dates are pd.Timestamp, and sorted.
df['TestDate'] = pd.to_datetime(df['TestDate'], dayfirst=True)
df = df.sort_values(by=['ID', 'TestDate'])
# Perform a shift and bfill at the 'ID' group level.
df['PrevPS'] = df.groupby('ID')['PS'].apply(lambda grp: grp.shift().bfill()).astype(int)
You can skip the date conversion and sort if you know beforehand that your DataFrame will be sorted correctly.
The resulting output:
ID TestDate PS PrevPS
0 1 2016-03-15 0 0
1 1 2016-04-01 2 0
2 1 2016-05-05 1 2
3 1 2016-06-07 1 1
4 2 2016-03-15 0 0
5 2 2016-04-01 2 0
6 2 2016-05-05 1 2
7 2 2016-06-07 3 1
8 2 2016-08-23 1 3
Upvotes: 2