Reputation: 1495
I developed a function that sorts a dataframe consisting of stock returns and signals by the return, then breaks up each row by percentiles to see how much profit/loss happens at the signal in each bin and the sum of the return or loss. It runs but runs slowly. I have two 'while' loops and a few 'if' loops in there so I'm sure this is what is slowing it down. Is there a way to speed this python
function up?
Here is some sample data to work with:
import numpy as np
import pandas as pd
#make y
y_mean = 1.6966731029796089e-06
y_std = 0.0010495629794829604
x_mean = -7.146476349274362e-06
x_std = 0.00020444862628284671
df_dict = {'x1':np.random.normal(loc=y_mean, scale = y_std, size = 100000), 'x2':np.random.normal(loc=x_mean, scale = x_std, size = 100000)}
df = pd.DataFrame(df_dict)
Here is the function itself. Again, this is working....but slowly. I use this function in a permutation test which means it runs 1000 times per permutation test. Right now, it takes 1 hour and 10 minutes to complete.
def roc_table(df, row_count, signal, returns):
"""
Parameters
----------
df : dataframe
row_count : length of data
signal : signal/s
returns : log returns
Returns
-------
table - hopefully
"""
df = df.copy()
bins = [.01, .05, .1, .2, .3, .4, .5, .6, .7, .8, .9, .95, .99]
df = df.sort_values(signal)
threshold = []
frac_greater = []
frac_less = []
win_above_list = []
win_below_list = []
lose_above_list = []
lose_below_list = []
work_signal = np.array(df[signal])
work_return = np.array(df[returns])
for bin_ in bins:
k = np.round((bin_*(row_count+1))-1)
k = int(k)
threshold.append(work_signal[k])
# print(threshold)
# print(k)
if k < 0:
k = 0
win_above = 1e-60
win_below = 1e-60
lose_above = 1e-60
lose_below = 1e-60
i=0
while i < k:
if work_return[i] > 0:
lose_below += work_return[i]
else:
win_below -= work_return[i]
i += 1
r = i
while r < row_count:
if work_return[r] > 0:
win_above += work_return[r]
else:
lose_above -= work_return[r]
r+=1
frac_greater.append((np.round(((row_count-k)/row_count),2)))
if lose_above > 0:
lose_above_list.append(np.round(win_above/lose_above,2))
else:
lose_above_list.append("inf")
if win_above > 0:
win_above_list.append(np.round((lose_above/win_above),2))
else:
win_above_list.append("inf")
frac_less.append(np.round((k/row_count),2))
if lose_below > 0:
lose_below_list.append(np.round((win_below/lose_below),2))
else:
lose_below_list.append("inf")
if win_below > 0:
win_below_list.append(np.round((lose_below/win_below),2))
else:
win_below_list.append("inf")
roc_dict = {"threshold":threshold,
"frac Gtr/Eq":frac_greater,
"Long PF":lose_above_list,
"Short PF":win_above_list,
"Frac Less":frac_less,
"Short PF Less":lose_below_list,
"Long PF Less":win_below_list}
roc = pd.DataFrame(roc_dict)
return roc
And then to run it just do:
df1 = roc_table(df, df.shape[0], 'x2', 'x1')
df1
I'm not sure what can be done but thank you in advance for taking a look.
Upvotes: 0
Views: 62
Reputation: 50358
You can use Numba to easily speed up the loops in the code. Here is an example:
import numba as nb
@nb.njit(nb.types.UniTuple(nb.float64,4)(nb.float64[::1], nb.int64, nb.int64))
def loops(work_return, row_count, k):
win_above = 1e-60
win_below = 1e-60
lose_above = 1e-60
lose_below = 1e-60
i=0
while i < k:
if work_return[i] > 0:
lose_below += work_return[i]
else:
win_below -= work_return[i]
i += 1
r = i
while r < row_count:
if work_return[r] > 0:
win_above += work_return[r]
else:
lose_above -= work_return[r]
r+=1
return win_above, win_below, lose_above, lose_below
def roc_table(df, row_count, signal, returns):
# [...] the beginning is left unchanged
for bin_ in bins:
k = np.round((bin_*(row_count+1))-1)
k = int(k)
threshold.append(work_signal[k])
if k < 0:
k = 0
win_above, win_below, lose_above, lose_below = loops(work_return, row_count, k)
# [...] the remaining is left unchanged
This makes the code 43 times faster on my machine. You can probably speed up the code even more if you optimize the whole outer loop with Numba but this is harder and may not worth it.
Upvotes: 2