How to iterate through unique ID and calculate difference between date values and write value to a new column

Question

I have a pandas data frame with a unique ID, a stage, and a date associated with that stage. I would like to create a new column showing the time spent in that particular stage. This means subtracting the date for that unique ID minus the date of the next stage associated with that same ID. Since the data is sorted on ID and stage, the last stage of that ID record should error out and read 'current' or n/a. It helps to see the data below. I'm thinking there is a way to do multiple for loops, but am unsure how to do this.

I have tried creating new columns before melting the data frame, but there are many cases in which the next stage is n/a. It is not always stage 1=>2=>3 etc, it can skip directly from stage 1 to stage 3.

   example.describe

   Opportunity_ID  stage      value
   0061R00000l43xP    1.0 2018-11-07
   0061R00000lUT5r    1.0 2019-05-02
   0061R00000lUT5r    2.0 2019-05-22
   0061R00000lUT5r    3.0 2019-06-03
   80061R0000lUT5r    5.0 2019-06-20
   0061R00000lUT5r    5.5 2019-09-10
   0061R00000lXwZL    1.0 2018-12-05
   0061R00000lXwZL    4.0 2019-04-09
   0061R00000lXwZL    5.0 2019-04-19
   0061R00000lXwZL    5.5 2019-04-19
   0061R00000lXwZL    8.0 2019-05-03
   0061R00000lXwZL    9.0 2019-07-09
   0061R00000lXwZL   11.0 2019-08-02
   0061R00000lY4Vm    1.0 2018-12-06
   0061R00000lY4Vm    2.0 2019-09-26
   0061R00000lrOGm    3.0 2019-02-15
   0061R00000lrOGm    4.0 2019-09-18


  [793 rows x 3 columns]>

  example.dtypes
  Opportunity_ID            object
  stage                    float64
  value             datetime64[ns]
  dtype: object

ansev · Accepted Answer

You can use sort_values + groupby to calculate the differences by group based on the ID. To calculate the difference the dates are converted to datetime using pd.to_datetime.:

df['value']=pd.to_datetime(df['value'])
df=df.sort_values(['Opportunity_ID','stage'])
df['difference']=df.groupby('Opportunity_ID')['value'].diff(-1)
print(df)

     Opportunity_ID  stage      value difference
0   0061R00000l43xP    1.0 2018-11-07        NaT
1   0061R00000lUT5r    1.0 2019-05-02   -20 days
2   0061R00000lUT5r    2.0 2019-05-22   -12 days
3   0061R00000lUT5r    3.0 2019-06-03   -99 days
5   0061R00000lUT5r    5.5 2019-09-10        NaT
6   0061R00000lXwZL    1.0 2018-12-05  -125 days
7   0061R00000lXwZL    4.0 2019-04-09   -10 days
8   0061R00000lXwZL    5.0 2019-04-19     0 days
9   0061R00000lXwZL    5.5 2019-04-19   -14 days
10  0061R00000lXwZL    8.0 2019-05-03   -67 days
11  0061R00000lXwZL    9.0 2019-07-09   -24 days
12  0061R00000lXwZL   11.0 2019-08-02        NaT
13  0061R00000lY4Vm    1.0 2018-12-06  -294 days
14  0061R00000lY4Vm    2.0 2019-09-26        NaT
15  0061R00000lrOGm    3.0 2019-02-15  -215 days
16  0061R00000lrOGm    4.0 2019-09-18        NaT
4   80061R0000lUT5r    5.0 2019-06-20        NaT

Ordering may not be convenient for you. you can calculate it without ordering the values previously.This would be the result for your example:

df['value']=pd.to_datetime(df['value'])
df['difference']=df.groupby('Opportunity_ID')['value'].diff(-1)
print(df)

    Opportunity_ID  stage      value difference
0   0061R00000l43xP    1.0 2018-11-07        NaT
1   0061R00000lUT5r    1.0 2019-05-02   -20 days
2   0061R00000lUT5r    2.0 2019-05-22   -12 days
3   0061R00000lUT5r    3.0 2019-06-03   -99 days
4   80061R0000lUT5r    5.0 2019-06-20        NaT
5   0061R00000lUT5r    5.5 2019-09-10        NaT
6   0061R00000lXwZL    1.0 2018-12-05  -125 days
7   0061R00000lXwZL    4.0 2019-04-09   -10 days
8   0061R00000lXwZL    5.0 2019-04-19     0 days
9   0061R00000lXwZL    5.5 2019-04-19   -14 days
10  0061R00000lXwZL    8.0 2019-05-03   -67 days
11  0061R00000lXwZL    9.0 2019-07-09   -24 days
12  0061R00000lXwZL   11.0 2019-08-02        NaT
13  0061R00000lY4Vm    1.0 2018-12-06  -294 days
14  0061R00000lY4Vm    2.0 2019-09-26        NaT
15  0061R00000lrOGm    3.0 2019-02-15  -215 days
16  0061R00000lrOGm    4.0 2019-09-18        NaT

How to iterate through unique ID and calculate difference between date values and write value to a new column

Answers (1)

Related Questions