GPB
GPB

Reputation: 2495

How can I make this loop more efficient?

I have a historical collection of ~ 500k loans, some of which have defaulted, others have not. My dataframe is lcd_temp. lcd_temp has information on the loan size (loan_amnt), if loan has defaulted or not (Total Defaults), annual loan rate (clean_rate),term of loan (clean_term), and months from origination to default (mos_to_default). mos_to_default is equal to clean_term if no default.

I would like to calculate the Cumulative Cashflow [cum_cf] for each loan as the sum of all coupons paid until default plus (1-severity) if loan defaults, and simply the loan_amnt if it pays back on time.

Here's my code, which takes an awful long time to run:

severity = 1

for i in range (0,len(lcd_temp['Total_Defaults'])-1):
    if (lcd_temp.loc[i,'Total_Defaults'] ==1):
    # Default, pay coupon only until time of default, plus (1-severity)
        lcd_temp.loc[i,'cum_cf'] = ((lcd_temp.loc[i,'mos_to_default'] /12)  * lcd_temp.loc[i,'clean_rate'])+(1 severity)*lcd_temp.loc[i,'loan_amnt']
    else: 
    # Total cf is sum of coupons (non compounded) + principal 
        lcd_temp.loc[i,'cum_cf'] = (1+lcd_temp.loc[i,'clean_term']/12* lcd_temp.loc[i,'clean_rate'])*lcd_temp.loc[i,'loan_amnt']

Any thoughts or suggestions on improving the speed (which takes over an hour so far) welcomed!

Upvotes: 2

Views: 111

Answers (1)

unutbu
unutbu

Reputation: 879799

Assuming you are using Pandas/NumPy, the standard way to replace an if-then construction such as the one you are using is to use np.where(mask, A, B). The mask is an array of boolean values. When True, the corresponding value from A is returned. When False, the corresponding value from B is returned. The result is an array of the same shape as mask with values from A and/or B.

severity = 1

mask = (lcd_temp['Total_Defaults'] == 1)
A = (((lcd_temp['mos_to_default'] /12) * lcd_temp['clean_rate'])
     + (1 severity)*lcd_temp['loan_amnt'])
B = (1+lcd_temp['clean_term']/12 * lcd_temp['clean_rate'])*lcd_temp['loan_amnt']

lcd_temp['cum_cf'] = np.where(mask, A, B)

Notice that this performs the calculation on whole columns instead of row-by-row. This improves performance greatly because it gives Pandas/NumPy the opportunity to pass larger arrays of values to fast underlying C/Fortran functions (in this case, to perform the arithmetic). When you work row-by-row, you are performing scalar arithmetic inside a Python loop, which gives NumPy zero chance to shine. If you had to compute row-by-row, you would be just as well (and maybe better) off using plain Python.

Even though A and B computes the values for the entire column -- and some values are not used in the final result returned by np.where -- this is still faster than computing row-by-row assuming there are more than a trivial number of rows.

Upvotes: 5

Related Questions