Reputation: 2495
I have a historical collection of ~ 500k loans, some of which have defaulted, others have not. My dataframe is lcd_temp
. lcd_temp
has information on the loan size (loan_amnt
), if loan has defaulted or not (Total Defaults
), annual loan rate (clean_rate
),term of loan (clean_term
), and months from origination to default (mos_to_default
). mos_to_default
is equal to clean_term
if no default.
I would like to calculate the Cumulative Cashflow [cum_cf
] for each loan as the sum of all coupons paid until default plus (1-severity) if loan defaults, and simply the loan_amnt
if it pays back on time.
Here's my code, which takes an awful long time to run:
severity = 1
for i in range (0,len(lcd_temp['Total_Defaults'])-1):
if (lcd_temp.loc[i,'Total_Defaults'] ==1):
# Default, pay coupon only until time of default, plus (1-severity)
lcd_temp.loc[i,'cum_cf'] = ((lcd_temp.loc[i,'mos_to_default'] /12) * lcd_temp.loc[i,'clean_rate'])+(1 severity)*lcd_temp.loc[i,'loan_amnt']
else:
# Total cf is sum of coupons (non compounded) + principal
lcd_temp.loc[i,'cum_cf'] = (1+lcd_temp.loc[i,'clean_term']/12* lcd_temp.loc[i,'clean_rate'])*lcd_temp.loc[i,'loan_amnt']
Any thoughts or suggestions on improving the speed (which takes over an hour so far) welcomed!
Upvotes: 2
Views: 111
Reputation: 879799
Assuming you are using Pandas/NumPy, the standard way to replace an if-then
construction such as the one you are using is to use np.where(mask, A, B)
. The mask
is an array of boolean values. When True, the corresponding value from A
is returned. When False, the corresponding value from B
is returned. The result is an array of the same shape as mask
with values from A
and/or B
.
severity = 1
mask = (lcd_temp['Total_Defaults'] == 1)
A = (((lcd_temp['mos_to_default'] /12) * lcd_temp['clean_rate'])
+ (1 severity)*lcd_temp['loan_amnt'])
B = (1+lcd_temp['clean_term']/12 * lcd_temp['clean_rate'])*lcd_temp['loan_amnt']
lcd_temp['cum_cf'] = np.where(mask, A, B)
Notice that this performs the calculation on whole columns instead of row-by-row. This improves performance greatly because it gives Pandas/NumPy the opportunity to pass larger arrays of values to fast underlying C/Fortran functions (in this case, to perform the arithmetic). When you work row-by-row, you are performing scalar arithmetic inside a Python loop, which gives NumPy zero chance to shine. If you had to compute row-by-row, you would be just as well (and maybe better) off using plain Python.
Even though A
and B
computes the values for the entire column -- and some values are not used in the final result returned by np.where
-- this is still faster than computing row-by-row assuming there are more than a trivial number of rows.
Upvotes: 5