Riley Hun
Riley Hun

Reputation: 2785

Python/Pandas: How to combine cumsum and cumcount with agg function?

I have a DataFrame where I am grouping by Internal Score and Issue Date (by Quarter). I then want to create a statistical table that includes the cumulative counts of the number of loans (represented by the distinct count of Loan #), the cumulative sum of the loan amounts, and the sums of Actual Loss and Outstanding Principal. The cumulative sum and cumulative count should include a snapshot of the first date up until that particular point in time. (i.e. cumulative sum of Q1 2015 to Q2 2015, then from Q1 2015 to Q3 2015, then from Q1 2015 to Q4 2015, etc.)

Sample Dataset:

   Loan #   Amount Issue Date TU Status List Internal Score  Last Actual Paid  \
0   57144  3337.76 2017-04-03              B              A               0.0   
1   57145  5536.46 2017-04-03              B              C               0.0   
2   57160  3443.91 2017-04-03              B              B               0.0   
3   57161  1162.79 2017-04-03              B              B               0.0   
4   57162  3845.98 2017-04-03              B              B               0.0   
5   57163  3441.50 2017-04-03              B              B               0.0   
6   57164  2039.96 2017-04-03              B              C               0.0   
7   57165  4427.53 2017-04-03              B              A               0.0   
8   57166  4427.53 2017-04-03              B              A               0.0   
9   57167  1617.77 2017-04-03              B              B               0.0   

   Outstanding-Principal  Actual Loss  
0                3337.76          0.0  
1                5536.46          0.0  
2                3443.91          0.0  
3                1162.79          0.0  
4                3845.98          0.0  
5                3441.50          0.0  
6                2039.96          0.0  
7                4427.53          0.0  
8                4427.53          0.0  
9                1617.77          0.0

I tried something like this:

container = []
for i in ['A', 'B', 'C', 'D']:

    subdf = df[df['Internal Score'].str.contains(i)]

    # Calculate Quarterly Vintages
    subdf.set_index('Issue Date', inplace=True)
    df2 = subdf.groupby(pd.TimeGrouper('Q')).agg({'Outstanding-Principal': np.sum, 'Actual Loss': np.sum,
                                                  'Amount': cumsum, 'Loan #': cumcount})
    df2['Internal Score'] = i
    container.append(df2)

ddf = pd.concat(container)

Upvotes: 3

Views: 2974

Answers (1)

pansen
pansen

Reputation: 6663

You can first use groupby and apply the cumsum afterwards.

I modified your dummy data while changing the dates to span across quarters to make your example more clear:

print(df)

    Loan #  Amount      Issue Date  Internal Score  Outstanding Principal   Actual Loss
0   57144   3337.76     2017-04-03  A               3337.76                 0.0
1   57145   5536.46     2017-04-03  C               5536.46                 0.0
2   57160   3443.91     2017-04-03  B               3443.91                 0.0
3   57161   1162.79     2017-04-03  B               1162.79                 0.0
4   57162   3845.98     2017-04-03  B               3845.98                 0.0
5   57163   3441.50     2017-07-03  B               3441.50                 0.0
6   57164   2039.96     2017-07-03  C               2039.96                 0.0
7   57165   4427.53     2017-07-03  A               4427.53                 0.0
8   57166   4427.53     2017-07-03  A               4427.53                 0.0
9   57167   1617.77     2017-07-03  B               1617.77                 0.0

First, create a column containing a key which identifies the quarter and the year of a given timestamp:

# in case it is not a timestamp already
df["Issue Date"] = pd.to_datetime(df["Issue Date"])

dt = df["Issue Date"].dt
df["Quarter"] = dt.strftime("%Y").str.cat(dt.quarter.astype(str), " Q")

print(df["Quarter"])

0    2017 Q2
1    2017 Q2
2    2017 Q2
3    2017 Q2
4    2017 Q2
5    2017 Q3
6    2017 Q3
7    2017 Q3
8    2017 Q3
9    2017 Q3
Name: Quarter, dtype: object

Now, aggreagte:

funcs = {'Outstanding Principal': np.sum, 
         'Actual Loss': np.sum, 
         'Amount': np.sum, 
         'Loan #': len}

result = df.groupby(['Internal Score', "Quarter"]).agg(funcs)
print(result)

                            Outstanding Principal   Amount      Actual Loss     Loan #
Internal Score  Quarter                 
             A  2017 Q2     3337.76                 3337.76     0.0             1
                2017 Q3     8855.06                 8855.06     0.0             2
             B  2017 Q2     8452.68                 8452.68     0.0             3
                2017 Q3     5059.27                 5059.27     0.0             2
             C  2017 Q2     5536.46                 5536.46     0.0             1
                2017 Q3     2039.96                 2039.96     0.0             1

And finally use transform with cumsum:

cum_cols = ["Amount", "Loan #"]
cumsums = result.groupby(level="Internal Score")[cum_cols].transform(lambda x: x.cumsum())
result.loc[:, cum_cols] = cumsums

print(result)
                            Outstanding Principal   Amount      Actual Loss     Loan #
Internal Score  Quarter                 
             A  2017 Q2     3337.76                 3337.76     0.0             1
                2017 Q3     8855.06                12192.82     0.0             3
             B  2017 Q2     8452.68                 8452.68     0.0             3
                2017 Q3     5059.27                13511.95     0.0             5
             C  2017 Q2     5536.46                 5536.46     0.0             1
                2017 Q3     2039.96                 7576.42     0.0             2

Upvotes: 3

Related Questions