Reputation: 13
I have a pandas data frame with a unique ID, a stage, and a date associated with that stage. I would like to create a new column showing the time spent in that particular stage. This means subtracting the date for that unique ID minus the date of the next stage associated with that same ID. Since the data is sorted on ID and stage, the last stage of that ID record should error out and read 'current' or n/a. It helps to see the data below. I'm thinking there is a way to do multiple for loops, but am unsure how to do this.
I have tried creating new columns before melting the data frame, but there are many cases in which the next stage is n/a. It is not always stage 1=>2=>3 etc, it can skip directly from stage 1 to stage 3.
example.describe
Opportunity_ID stage value
0061R00000l43xP 1.0 2018-11-07
0061R00000lUT5r 1.0 2019-05-02
0061R00000lUT5r 2.0 2019-05-22
0061R00000lUT5r 3.0 2019-06-03
80061R0000lUT5r 5.0 2019-06-20
0061R00000lUT5r 5.5 2019-09-10
0061R00000lXwZL 1.0 2018-12-05
0061R00000lXwZL 4.0 2019-04-09
0061R00000lXwZL 5.0 2019-04-19
0061R00000lXwZL 5.5 2019-04-19
0061R00000lXwZL 8.0 2019-05-03
0061R00000lXwZL 9.0 2019-07-09
0061R00000lXwZL 11.0 2019-08-02
0061R00000lY4Vm 1.0 2018-12-06
0061R00000lY4Vm 2.0 2019-09-26
0061R00000lrOGm 3.0 2019-02-15
0061R00000lrOGm 4.0 2019-09-18
[793 rows x 3 columns]>
example.dtypes
Opportunity_ID object
stage float64
value datetime64[ns]
dtype: object
Upvotes: 1
Views: 913
Reputation: 30920
You can use sort_values
+ groupby
to calculate the differences by group based on the ID.
To calculate the difference the dates are converted to datetime using pd.to_datetime
.:
df['value']=pd.to_datetime(df['value'])
df=df.sort_values(['Opportunity_ID','stage'])
df['difference']=df.groupby('Opportunity_ID')['value'].diff(-1)
print(df)
Opportunity_ID stage value difference
0 0061R00000l43xP 1.0 2018-11-07 NaT
1 0061R00000lUT5r 1.0 2019-05-02 -20 days
2 0061R00000lUT5r 2.0 2019-05-22 -12 days
3 0061R00000lUT5r 3.0 2019-06-03 -99 days
5 0061R00000lUT5r 5.5 2019-09-10 NaT
6 0061R00000lXwZL 1.0 2018-12-05 -125 days
7 0061R00000lXwZL 4.0 2019-04-09 -10 days
8 0061R00000lXwZL 5.0 2019-04-19 0 days
9 0061R00000lXwZL 5.5 2019-04-19 -14 days
10 0061R00000lXwZL 8.0 2019-05-03 -67 days
11 0061R00000lXwZL 9.0 2019-07-09 -24 days
12 0061R00000lXwZL 11.0 2019-08-02 NaT
13 0061R00000lY4Vm 1.0 2018-12-06 -294 days
14 0061R00000lY4Vm 2.0 2019-09-26 NaT
15 0061R00000lrOGm 3.0 2019-02-15 -215 days
16 0061R00000lrOGm 4.0 2019-09-18 NaT
4 80061R0000lUT5r 5.0 2019-06-20 NaT
Ordering may not be convenient for you. you can calculate it without ordering the values previously.This would be the result for your example:
df['value']=pd.to_datetime(df['value'])
df['difference']=df.groupby('Opportunity_ID')['value'].diff(-1)
print(df)
Opportunity_ID stage value difference
0 0061R00000l43xP 1.0 2018-11-07 NaT
1 0061R00000lUT5r 1.0 2019-05-02 -20 days
2 0061R00000lUT5r 2.0 2019-05-22 -12 days
3 0061R00000lUT5r 3.0 2019-06-03 -99 days
4 80061R0000lUT5r 5.0 2019-06-20 NaT
5 0061R00000lUT5r 5.5 2019-09-10 NaT
6 0061R00000lXwZL 1.0 2018-12-05 -125 days
7 0061R00000lXwZL 4.0 2019-04-09 -10 days
8 0061R00000lXwZL 5.0 2019-04-19 0 days
9 0061R00000lXwZL 5.5 2019-04-19 -14 days
10 0061R00000lXwZL 8.0 2019-05-03 -67 days
11 0061R00000lXwZL 9.0 2019-07-09 -24 days
12 0061R00000lXwZL 11.0 2019-08-02 NaT
13 0061R00000lY4Vm 1.0 2018-12-06 -294 days
14 0061R00000lY4Vm 2.0 2019-09-26 NaT
15 0061R00000lrOGm 3.0 2019-02-15 -215 days
16 0061R00000lrOGm 4.0 2019-09-18 NaT
Upvotes: 1