Reputation: 151
I have a dataframe that is a time series and I want to get the cumsum of differences between the CLOSE
and SUBMISSION
of an issue. However, I want it only to subtract if the CLOSE
value is higher than the SUBMISSION
value. Here are the data points (sorted by CLOSE
), the expected output, and my attempted code:
df = pd.DataFrame({'REF_KEY': [1, 2, 3, 4, 5], 'SUBMISSION': ['2018-08-21', '2018-09-03', '2018-09-07', '2018-09-06', '2018-08-28'], 'CLOSE': ['2018-09-05', '2018-09-12', '2018-09-18', '2018-09-24', '2018-09-27']})
df['CLOSE'] = df['CLOSE'].astype('datetime64[ns]')
df['SUBMISSION'] = df['SUBMISSION'].astype('datetime64[ns]')
For REF_KEY == 1
, ACCUM_DATE_DELTA
should be the sum of:
For REF_KEY == 2
, you will get the sum of:
So for REF_KEY == 1
, you can see that the difference between its close date includes REF_KEY == [3, 4]
, and that is because the CLOSE
is greater than SUBMISSION
. Therefore, I had the idea of creating a condition where the CLOSE
date has to be more than SUBMISSION
date.
df_2 = pd.DataFrame({'REF_KEY': [1, 2, 3, 4, 5],
'SUBMISSION': ['2018-08-21', '2018-09-03', '2018-09-07', '2018-09-06', '2018-08-28'], 'CLOSE': ['2018-09-05', '2018-09-12', '2018-09-18', '2018-09-24', '2018-09-27'], 'ACCUM_DATE_DELTA': [25, 57, 86, 116, 131]})
df_2['CLOSE'] = df['CLOSE'].astype('datetime64[ns]')
df_2['SUBMISSION'] = df['SUBMISSION'].astype('datetime64[ns]')
Attempted code:
df_2['ACCUM_DATE_DELTA'] = df_2['CLOSE']*len(df_2[df_2['CLOSE'] - df_2['SUBMISSION]]['SUBMISSION'].cumsum()) - df_2[df_2['CLOSE'] - df_2['SUBMISSION]]['SUBMISSION'].cumsum()
Upvotes: 1
Views: 669
Reputation: 41327
merge
to generate their cartesian product of SUBMISSION
x CLOSE
where
CLOSE > SUBMISSION
groupby
the CLOSE
dates and sum the group's CLOSE - SUBMISSION
daysmerge
the ACCUM
values back to the original dfm = pd.merge(df.SUBMISSION, df.CLOSE, how='cross') # cross-merge for all SUBMISSION x CLOSE combos
accum = (m.where(m.CLOSE > m.SUBMISSION) # limit to CLOSE > SUBMISSION
.groupby('CLOSE').SUBMISSION # group by CLOSE
.apply(lambda g: (g.name - g).sum()) # sum of all (CLOSE - SUBMISSION)
.rename('ACCUM'))
df.merge(accum, on='CLOSE') # merge back to df
Output:
REF_KEY SUBMISSION CLOSE ACCUM
0 1 2018-08-21 2018-09-05 25 days
1 2 2018-09-03 2018-09-12 57 days
2 3 2018-09-07 2018-09-18 87 days
3 4 2018-09-06 2018-09-24 117 days
4 5 2018-08-28 2018-09-27 132 days
Notes:
how='cross'
requires pandas 1.2.0+, so for earlier versions, merge
on a dummy key
column:
m = df[['SUBMISSION']].assign(key=0).merge(df[['CLOSE']].assign(key=0), on='key').drop(columns='key')
As with Jonathan's solution, a couple of these days are off by 1 compared to your output.
Upvotes: 1
Reputation: 5648
I'm not whose calculations are correct. If mine are, let me know if this is what you want.
For 'REF_KEY' == 1, you can use this to find the cumulative sum of days
(df['CLOSE'][0] - df['SUBMISSION']).dt.days.clip(lower=0).sum()
df['CLOSE'][0]
is first close date to calculate against all submission dates; dt.days
gives days in integer
(df['CLOSE'][0] - df['SUBMISSION']).dt.days
0 15
1 2
2 -2
3 -1
4 8
Name: SUBMISSION, dtype: int64
Use clip(lower=0).sum()
to change negative values to zero and sum
(df['CLOSE'][0] - df['SUBMISSION']).dt.days.clip(lower=0).sum()
result = 25
To automate this, use apply()
with a custom function
def calc(x):
# print((x - df['SUBMISSION']).dt.days.clip(lower=0).sum())
return (x - df['SUBMISSION']).dt.days.clip(lower=0).sum()
df
REF_KEY SUBMISSION CLOSE
0 1 2018-08-21 2018-09-05
1 2 2018-09-03 2018-09-12
2 3 2018-09-07 2018-09-18
3 4 2018-09-06 2018-09-24
4 5 2018-08-28 2018-09-27
df['ACCUM_DATE_DELTA'] = df.apply(lambda x: calc(x['CLOSE']), axis=1)
REF_KEY SUBMISSION CLOSE ACCUM_DATE_DELTA
0 1 2018-08-21 2018-09-05 25
1 2 2018-09-03 2018-09-12 57
2 3 2018-09-07 2018-09-18 87
3 4 2018-09-06 2018-09-24 117
4 5 2018-08-28 2018-09-27 132
Upvotes: 0