Reputation: 1557
I have a pandas dataframe with a format like this:
student_id subject_id subject_date
100 2000 2010-01-01
100 2001 2010-03-05
100 2002 2012-05-25
101 2000 2009-01-10
101 2001 2016-08-16
102 2000 2008-05-05
102 2003 2008-05-20
102 2004 2009-01-03
102 2005 2010-02-13
The dataframe is already ordered by student_id
and subject_date
.
The goal is to get the difference between subject_date
for each student_id
. For each student_id
, it is guaranteed that there will be a minimum of 2 distinct subject_id
. The resulting dataframe will look something like this:
student_id subject_id subject_date diff_in_dates
100 2000 2010-01-01 NA
100 2001 2010-03-05 30
100 2002 2012-05-25 60
101 2000 2009-01-10 NA
101 2001 2016-08-16 3000
102 2000 2008-05-05 NA
102 2003 2008-05-20 15
102 2004 2009-01-03 180
102 2005 2010-02-13 370
diff_in_dates
values are just an approximation here and not the actual difference.
Upvotes: 0
Views: 593
Reputation: 5126
this is simple in practice! Check out dif()
:
df1['diff_in_date'] = df1.groupby('student_id')['subject_date'].diff()
student_id subject_id subject_date diff_in_date
0 100 2000 2010-01-01 NaT
1 100 2001 2010-03-05 63 days
2 100 2002 2012-05-25 812 days
3 101 2000 2009-01-10 NaT
4 101 2001 2016-08-16 2775 days
5 102 2000 2008-05-05 NaT
6 102 2003 2008-05-20 15 days
7 102 2004 2009-01-03 228 days
8 102 2005 2010-02-13 406 days
dif()
is just doing a difference from the current record and the previous record along the column. It's great that your data was already sorted correctly!
for future use dif()
could also take in a parameter to add multiple row differences into the result. so see the an example output below:
df1['diff_in_date'] = df1.groupby('student_id')['subject_date'].diff(2)
# output
student_id subject_id subject_date diff_in_date
0 100 2000 2010-01-01 NaT
1 100 2001 2010-03-05 NaT
2 100 2002 2012-05-25 875 days
3 101 2000 2009-01-10 NaT
4 101 2001 2016-08-16 NaT
5 102 2000 2008-05-05 NaT
6 102 2003 2008-05-20 NaT
7 102 2004 2009-01-03 243 days
8 102 2005 2010-02-13 634 days
Upvotes: 4
Reputation: 210832
IIUC:
In [362]: df['diff_in_dates '] = df.groupby('student_id')['subject_date'].diff().dt.days
In [363]: df
Out[363]:
student_id subject_id subject_date diff_in_dates
0 100 2000 2010-01-01 NaN
1 100 2001 2010-03-05 63.0
2 100 2002 2012-05-25 812.0
3 101 2000 2009-01-10 NaN
4 101 2001 2016-08-16 2775.0
5 102 2000 2008-05-05 NaN
6 102 2003 2008-05-20 15.0
7 102 2004 2009-01-03 228.0
8 102 2005 2010-02-13 406.0
Upvotes: 6