Reputation: 2155
Hi have created a OLS regression using Statsmodels
I've written some code that loops through every variable in a dataframe and enters it into the model and then records the T Stat in a new dataframe and builds a list of potential variables.
However I have 20,000 variables so it takes ages to run each time.
Can anyone think of a better approach?
This is my current approach
TStatsOut=pd.DataFrame()
for i in VarsOut:
try:
xstrout='+'.join([baseterms,i])
fout='ymod~'+xstrout
modout = smf.ols(fout, data=df_train).fit()
j=pd.DataFrame(modout.pvalues,index=[i],columns=['PValue'])
k=pd.DataFrame(modout.params,index=[i],columns=['Coeff'])
s=pd.concat([j, k], axis=1, join_axes=[j.index])
TStatsOut=TStatsOut.append(s)
Upvotes: 0
Views: 951
Reputation: 759
Here is what I have found in regards to your question. My answer uses the approach of using dask
for distributed computing, and also just general clean up of you current approach.
I made a smaller fake dataset with 1000 variables, one will be the outcome, and two will be the baseterms
, so there is really 997 variables to loop through.
import dask
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
#make some toy data for the case you showed
df_train = pd.DataFrame(np.random.randint(low=0,high=10,size=(10000, 1000)))
df_train.columns = ['var'+str(x) for x in df_train.columns]
baseterms = 'var1+var2'
VarsOut = df_train.columns[3:]
Baseline for your current Code (20s +- 858ms):
%%timeit
TStatsOut=pd.DataFrame()
for i in VarsOut:
xstrout='+'.join([baseterms,i])
fout='var0~'+xstrout
modout = smf.ols(fout, data=df_train).fit()
j=pd.DataFrame(modout.pvalues,index=[i],columns=['PValue'])
k=pd.DataFrame(modout.params,index=[i],columns=['Coeff'])
s=pd.concat([j, k], axis=1)
s=s.reindex(j.index)
TStatsOut=TStatsOut.append(s)
Created a function for readability, but returns just the pval, and regression coefficient for each variable tested instead of the one line dataframes.
def testVar(i):
xstrout='+'.join([baseterms,i])
fout='var0~'+xstrout
modout = smf.ols(fout, data=df_train).fit()
pval=modout.pvalues[i]
coef=modout.params[i]
return pval, coef
Now runs at (14.1s +- 982ms)
%%timeit
pvals=[]
coefs=[]
for i in VarsOut:
pval, coef = testVar(i)
pvals.append(pval)
coefs.append(coef)
TStatsOut = pd.DataFrame(data={'PValue':pvals, 'Coeff':coefs},
index=VarsOut)[['PValue','Coeff']]
Using Dask delayed for parallel processing. Keep in mind each delayed task that is created cause a slight overhead as well, so sometimes it it may not be beneficial, but will depend on your exact dataset and how long the regressions are taking. My data example may be too simple to show any benefit.
#define the same function as before, but tell dask how many outputs it has
@dask.delayed(nout=2)
def testVar(i):
xstrout='+'.join([baseterms,i])
fout='var0~'+xstrout
modout = smf.ols(fout, data=df_train).fit()
pval=modout.pvalues[i]
coef=modout.params[i]
return pval, coef
Now run through the 997 candidate variables and create the same dataframe with dask delayed. (18.6s +- 588ms)
%%timeit
pvals=[]
coefs=[]
for i in VarsOut:
pval, coef = dask.delayed(testVar)(i)
pvals.append(pval)
coefs.append(coef)
pvals, coefs = dask.compute(pvals,coefs)
TStatsOut = pd.DataFrame(data={'PValue':pvals, 'Coeff':coefs},
index=VarsOut)[['PValue','Coeff']]
Again, dask delayed creates more overhead as it creates the tasks to be sent across many processors, so any performance gain will depend on the time your data actually takes in the regressions as well as how many CPUs you have availible. Dask can be scaled from a single workstation to a cluster of workstations.
Upvotes: 2