Jordan
Jordan

Reputation: 1495

How to optimize my python function of loops

I developed a function that sorts a dataframe consisting of stock returns and signals by the return, then breaks up each row by percentiles to see how much profit/loss happens at the signal in each bin and the sum of the return or loss. It runs but runs slowly. I have two 'while' loops and a few 'if' loops in there so I'm sure this is what is slowing it down. Is there a way to speed this python function up?

Here is some sample data to work with:

import numpy as np
import pandas as pd
#make y
y_mean = 1.6966731029796089e-06
y_std =  0.0010495629794829604

x_mean = -7.146476349274362e-06
x_std = 0.00020444862628284671

df_dict = {'x1':np.random.normal(loc=y_mean, scale = y_std, size = 100000), 'x2':np.random.normal(loc=x_mean, scale = x_std, size = 100000)}

df = pd.DataFrame(df_dict)

Here is the function itself. Again, this is working....but slowly. I use this function in a permutation test which means it runs 1000 times per permutation test. Right now, it takes 1 hour and 10 minutes to complete.

def roc_table(df, row_count, signal, returns):
    """
    

    Parameters
    ----------
    df : dataframe
    row_count : length of data
    signal : signal/s
    returns : log returns

    Returns
    -------
    table - hopefully

    """
    df = df.copy()
    
    bins = [.01, .05, .1, .2, .3, .4, .5, .6, .7, .8, .9, .95, .99]
    
    df = df.sort_values(signal)
    threshold = []
    frac_greater = []
    frac_less = []
    win_above_list = []
    win_below_list = []
    lose_above_list = []
    lose_below_list = []
    
    work_signal = np.array(df[signal])
    work_return = np.array(df[returns])
    
    
    for bin_ in bins:
        k = np.round((bin_*(row_count+1))-1)
        k = int(k)
        threshold.append(work_signal[k])
        # print(threshold)
        # print(k)
        if k < 0:
            k = 0   
        win_above = 1e-60
        win_below = 1e-60
        lose_above = 1e-60
        lose_below = 1e-60
    

        i=0
        while i < k:
            if work_return[i] > 0:
                lose_below += work_return[i]
            else:
                win_below -= work_return[i]

            i += 1
        
        
            
        r = i
        while r < row_count:
            if work_return[r] > 0:
                win_above += work_return[r]
            else: 
                lose_above -= work_return[r]
            r+=1
        

        frac_greater.append((np.round(((row_count-k)/row_count),2)))
        if lose_above > 0:
            lose_above_list.append(np.round(win_above/lose_above,2))
        else:
            lose_above_list.append("inf")
            
        if win_above > 0:
            win_above_list.append(np.round((lose_above/win_above),2))
        else:
            win_above_list.append("inf")
            
        frac_less.append(np.round((k/row_count),2))
        
        if lose_below > 0:
            lose_below_list.append(np.round((win_below/lose_below),2))
        else:
            lose_below_list.append("inf")
            
        if win_below > 0:
            win_below_list.append(np.round((lose_below/win_below),2))
        else:
            win_below_list.append("inf")
            
    roc_dict = {"threshold":threshold,
                "frac Gtr/Eq":frac_greater,
                "Long PF":lose_above_list,
                "Short PF":win_above_list,
                "Frac Less":frac_less,
                "Short PF Less":lose_below_list,
                "Long PF Less":win_below_list}
    

    
    roc = pd.DataFrame(roc_dict)
        
    return roc        

And then to run it just do:

df1 = roc_table(df, df.shape[0], 'x2', 'x1')
df1

I'm not sure what can be done but thank you in advance for taking a look.

Upvotes: 0

Views: 62

Answers (1)

J&#233;r&#244;me Richard
J&#233;r&#244;me Richard

Reputation: 50358

You can use Numba to easily speed up the loops in the code. Here is an example:

import numba as nb

@nb.njit(nb.types.UniTuple(nb.float64,4)(nb.float64[::1], nb.int64, nb.int64))
def loops(work_return, row_count, k):
    win_above = 1e-60
    win_below = 1e-60
    lose_above = 1e-60
    lose_below = 1e-60

    i=0
    while i < k:
        if work_return[i] > 0:
            lose_below += work_return[i]
        else:
            win_below -= work_return[i]
        i += 1

    r = i
    while r < row_count:
        if work_return[r] > 0:
            win_above += work_return[r]
        else: 
            lose_above -= work_return[r]
        r+=1

    return win_above, win_below, lose_above, lose_below

def roc_table(df, row_count, signal, returns):
    # [...] the beginning is left unchanged

    for bin_ in bins:
        k = np.round((bin_*(row_count+1))-1)
        k = int(k)
        threshold.append(work_signal[k])
        if k < 0:
            k = 0

        win_above, win_below, lose_above, lose_below = loops(work_return, row_count, k)

        # [...] the remaining is left unchanged

This makes the code 43 times faster on my machine. You can probably speed up the code even more if you optimize the whole outer loop with Numba but this is harder and may not worth it.

Upvotes: 2

Related Questions