Subtract one column by all values in another column only if condition is met

Question

I have a dataframe that is a time series and I want to get the cumsum of differences between the CLOSE and SUBMISSION of an issue. However, I want it only to subtract if the CLOSE value is higher than the SUBMISSION value. Here are the data points (sorted by CLOSE), the expected output, and my attempted code:

df = pd.DataFrame({'REF_KEY': [1, 2, 3, 4, 5], 'SUBMISSION': ['2018-08-21', '2018-09-03', '2018-09-07', '2018-09-06', '2018-08-28'], 'CLOSE': ['2018-09-05', '2018-09-12', '2018-09-18', '2018-09-24', '2018-09-27']})
df['CLOSE'] = df['CLOSE'].astype('datetime64[ns]')
df['SUBMISSION'] = df['SUBMISSION'].astype('datetime64[ns]')

For REF_KEY == 1, ACCUM_DATE_DELTA should be the sum of:

15 day difference of ('2018-09-05' - '2018-08-21')
2 day difference between ('2018-09-05' - '09-03-2018')
8 day difference between ('2018-09-05' - '2018-08-28') making it 26

For REF_KEY == 2, you will get the sum of:

22 day difference between ('2018-09-12' - '2018-08-21')
9 day difference between ('2018-09-12' - '2018-09-03')
5 day difference between ('2018-09-12' - '2018-09-07')
6 day difference between ('2018-09-12' - '2018-09-06')
15 day difference between ('2018-09-12' - '2018-08-28')

So for REF_KEY == 1, you can see that the difference between its close date includes REF_KEY == [3, 4], and that is because the CLOSE is greater than SUBMISSION. Therefore, I had the idea of creating a condition where the CLOSE date has to be more than SUBMISSION date.

df_2 = pd.DataFrame({'REF_KEY': [1, 2, 3, 4, 5], 
                     'SUBMISSION': ['2018-08-21', '2018-09-03', '2018-09-07', '2018-09-06', '2018-08-28'], 'CLOSE': ['2018-09-05', '2018-09-12', '2018-09-18', '2018-09-24', '2018-09-27'], 'ACCUM_DATE_DELTA': [25, 57, 86, 116, 131]})
df_2['CLOSE'] = df['CLOSE'].astype('datetime64[ns]')
df_2['SUBMISSION'] = df['SUBMISSION'].astype('datetime64[ns]')

Attempted code:

df_2['ACCUM_DATE_DELTA'] = df_2['CLOSE']*len(df_2[df_2['CLOSE'] - df_2['SUBMISSION]]['SUBMISSION'].cumsum()) - df_2[df_2['CLOSE'] - df_2['SUBMISSION]]['SUBMISSION'].cumsum()

tdy · Accepted Answer

Cross-merge to generate their cartesian product of SUBMISSION x CLOSE
Keep only the rows where CLOSE > SUBMISSION
groupby the CLOSE dates and sum the group's CLOSE - SUBMISSION days
merge the ACCUM values back to the original df

m = pd.merge(df.SUBMISSION, df.CLOSE, how='cross') # cross-merge for all SUBMISSION x CLOSE combos

accum = (m.where(m.CLOSE > m.SUBMISSION)           # limit to CLOSE > SUBMISSION
          .groupby('CLOSE').SUBMISSION             # group by CLOSE
          .apply(lambda g: (g.name - g).sum())     # sum of all (CLOSE - SUBMISSION)
          .rename('ACCUM'))

df.merge(accum, on='CLOSE')                        # merge back to df

Output:

   REF_KEY  SUBMISSION       CLOSE     ACCUM
0        1  2018-08-21  2018-09-05   25 days
1        2  2018-09-03  2018-09-12   57 days
2        3  2018-09-07  2018-09-18   87 days
3        4  2018-09-06  2018-09-24  117 days
4        5  2018-08-28  2018-09-27  132 days

Notes:

how='cross' requires pandas 1.2.0+, so for earlier versions, merge on a dummy key column:

m = df[['SUBMISSION']].assign(key=0).merge(df[['CLOSE']].assign(key=0), on='key').drop(columns='key')

As with Jonathan's solution, a couple of these days are off by 1 compared to your output.

Subtract one column by all values in another column only if condition is met

Answers (2)

Related Questions