Reputation: 2785
I have a DataFrame where I am grouping by Internal Score
and Issue Date
(by Quarter). I then want to create a statistical table that includes the cumulative counts of the number of loans (represented by the distinct count of Loan #
), the cumulative sum of the loan amounts, and the sums of Actual Loss
and Outstanding Principal
. The cumulative sum and cumulative count should include a snapshot of the first date up until that particular point in time. (i.e. cumulative sum of Q1 2015 to Q2 2015, then from Q1 2015 to Q3 2015, then from Q1 2015 to Q4 2015, etc.)
Sample Dataset:
Loan # Amount Issue Date TU Status List Internal Score Last Actual Paid \
0 57144 3337.76 2017-04-03 B A 0.0
1 57145 5536.46 2017-04-03 B C 0.0
2 57160 3443.91 2017-04-03 B B 0.0
3 57161 1162.79 2017-04-03 B B 0.0
4 57162 3845.98 2017-04-03 B B 0.0
5 57163 3441.50 2017-04-03 B B 0.0
6 57164 2039.96 2017-04-03 B C 0.0
7 57165 4427.53 2017-04-03 B A 0.0
8 57166 4427.53 2017-04-03 B A 0.0
9 57167 1617.77 2017-04-03 B B 0.0
Outstanding-Principal Actual Loss
0 3337.76 0.0
1 5536.46 0.0
2 3443.91 0.0
3 1162.79 0.0
4 3845.98 0.0
5 3441.50 0.0
6 2039.96 0.0
7 4427.53 0.0
8 4427.53 0.0
9 1617.77 0.0
I tried something like this:
container = []
for i in ['A', 'B', 'C', 'D']:
subdf = df[df['Internal Score'].str.contains(i)]
# Calculate Quarterly Vintages
subdf.set_index('Issue Date', inplace=True)
df2 = subdf.groupby(pd.TimeGrouper('Q')).agg({'Outstanding-Principal': np.sum, 'Actual Loss': np.sum,
'Amount': cumsum, 'Loan #': cumcount})
df2['Internal Score'] = i
container.append(df2)
ddf = pd.concat(container)
Upvotes: 3
Views: 2974
Reputation: 6663
You can first use groupby
and apply the cumsum
afterwards.
I modified your dummy data while changing the dates to span across quarters to make your example more clear:
print(df)
Loan # Amount Issue Date Internal Score Outstanding Principal Actual Loss
0 57144 3337.76 2017-04-03 A 3337.76 0.0
1 57145 5536.46 2017-04-03 C 5536.46 0.0
2 57160 3443.91 2017-04-03 B 3443.91 0.0
3 57161 1162.79 2017-04-03 B 1162.79 0.0
4 57162 3845.98 2017-04-03 B 3845.98 0.0
5 57163 3441.50 2017-07-03 B 3441.50 0.0
6 57164 2039.96 2017-07-03 C 2039.96 0.0
7 57165 4427.53 2017-07-03 A 4427.53 0.0
8 57166 4427.53 2017-07-03 A 4427.53 0.0
9 57167 1617.77 2017-07-03 B 1617.77 0.0
First, create a column containing a key which identifies the quarter and the year of a given timestamp:
# in case it is not a timestamp already
df["Issue Date"] = pd.to_datetime(df["Issue Date"])
dt = df["Issue Date"].dt
df["Quarter"] = dt.strftime("%Y").str.cat(dt.quarter.astype(str), " Q")
print(df["Quarter"])
0 2017 Q2
1 2017 Q2
2 2017 Q2
3 2017 Q2
4 2017 Q2
5 2017 Q3
6 2017 Q3
7 2017 Q3
8 2017 Q3
9 2017 Q3
Name: Quarter, dtype: object
Now, aggreagte:
funcs = {'Outstanding Principal': np.sum,
'Actual Loss': np.sum,
'Amount': np.sum,
'Loan #': len}
result = df.groupby(['Internal Score', "Quarter"]).agg(funcs)
print(result)
Outstanding Principal Amount Actual Loss Loan #
Internal Score Quarter
A 2017 Q2 3337.76 3337.76 0.0 1
2017 Q3 8855.06 8855.06 0.0 2
B 2017 Q2 8452.68 8452.68 0.0 3
2017 Q3 5059.27 5059.27 0.0 2
C 2017 Q2 5536.46 5536.46 0.0 1
2017 Q3 2039.96 2039.96 0.0 1
And finally use transform
with cumsum
:
cum_cols = ["Amount", "Loan #"]
cumsums = result.groupby(level="Internal Score")[cum_cols].transform(lambda x: x.cumsum())
result.loc[:, cum_cols] = cumsums
print(result)
Outstanding Principal Amount Actual Loss Loan #
Internal Score Quarter
A 2017 Q2 3337.76 3337.76 0.0 1
2017 Q3 8855.06 12192.82 0.0 3
B 2017 Q2 8452.68 8452.68 0.0 3
2017 Q3 5059.27 13511.95 0.0 5
C 2017 Q2 5536.46 5536.46 0.0 1
2017 Q3 2039.96 7576.42 0.0 2
Upvotes: 3