j1897
j1897

Reputation: 1557

How to find difference between dates per each id in python?

I have a pandas dataframe with a format like this:

student_id     subject_id   subject_date  
100             2000        2010-01-01
100             2001        2010-03-05
100             2002        2012-05-25
101             2000        2009-01-10
101             2001        2016-08-16
102             2000        2008-05-05
102             2003        2008-05-20
102             2004        2009-01-03
102             2005        2010-02-13

The dataframe is already ordered by student_id and subject_date. The goal is to get the difference between subject_date for each student_id. For each student_id, it is guaranteed that there will be a minimum of 2 distinct subject_id. The resulting dataframe will look something like this:

student_id     subject_id   subject_date  diff_in_dates  
100             2000        2010-01-01    NA
100             2001        2010-03-05    30
100             2002        2012-05-25    60
101             2000        2009-01-10    NA
101             2001        2016-08-16    3000
102             2000        2008-05-05    NA
102             2003        2008-05-20    15
102             2004        2009-01-03    180
102             2005        2010-02-13    370

diff_in_dates values are just an approximation here and not the actual difference.

Upvotes: 0

Views: 593

Answers (2)

MattR
MattR

Reputation: 5126

this is simple in practice! Check out dif():

df1['diff_in_date'] = df1.groupby('student_id')['subject_date'].diff()

   student_id  subject_id subject_date diff_in_date
0         100        2000   2010-01-01          NaT
1         100        2001   2010-03-05      63 days
2         100        2002   2012-05-25     812 days
3         101        2000   2009-01-10          NaT
4         101        2001   2016-08-16    2775 days
5         102        2000   2008-05-05          NaT
6         102        2003   2008-05-20      15 days
7         102        2004   2009-01-03     228 days
8         102        2005   2010-02-13     406 days

dif() is just doing a difference from the current record and the previous record along the column. It's great that your data was already sorted correctly!

for future use dif() could also take in a parameter to add multiple row differences into the result. so see the an example output below:

df1['diff_in_date'] = df1.groupby('student_id')['subject_date'].diff(2)

# output
   student_id  subject_id subject_date diff_in_date
0         100        2000   2010-01-01          NaT
1         100        2001   2010-03-05          NaT
2         100        2002   2012-05-25     875 days
3         101        2000   2009-01-10          NaT
4         101        2001   2016-08-16          NaT
5         102        2000   2008-05-05          NaT
6         102        2003   2008-05-20          NaT
7         102        2004   2009-01-03     243 days
8         102        2005   2010-02-13     634 days 

Upvotes: 4

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210832

IIUC:

In [362]: df['diff_in_dates '] = df.groupby('student_id')['subject_date'].diff().dt.days

In [363]: df
Out[363]:
   student_id  subject_id subject_date  diff_in_dates
0         100        2000   2010-01-01             NaN
1         100        2001   2010-03-05            63.0
2         100        2002   2012-05-25           812.0
3         101        2000   2009-01-10             NaN
4         101        2001   2016-08-16          2775.0
5         102        2000   2008-05-05             NaN
6         102        2003   2008-05-20            15.0
7         102        2004   2009-01-03           228.0
8         102        2005   2010-02-13           406.0

Upvotes: 6

Related Questions