fred.schwartz
fred.schwartz

Reputation: 2155

python OLS statsmodels T Stats of variables not entered into the model

Hi have created a OLS regression using Statsmodels

I've written some code that loops through every variable in a dataframe and enters it into the model and then records the T Stat in a new dataframe and builds a list of potential variables.

However I have 20,000 variables so it takes ages to run each time.

Can anyone think of a better approach?

This is my current approach

TStatsOut=pd.DataFrame()

for i in VarsOut:
    try:
        xstrout='+'.join([baseterms,i])
        fout='ymod~'+xstrout
        modout = smf.ols(fout, data=df_train).fit()
        j=pd.DataFrame(modout.pvalues,index=[i],columns=['PValue'])
        k=pd.DataFrame(modout.params,index=[i],columns=['Coeff'])
        s=pd.concat([j, k], axis=1, join_axes=[j.index])
        TStatsOut=TStatsOut.append(s)

Upvotes: 0

Views: 951

Answers (1)

jtweeder
jtweeder

Reputation: 759

Here is what I have found in regards to your question. My answer uses the approach of using dask for distributed computing, and also just general clean up of you current approach.

I made a smaller fake dataset with 1000 variables, one will be the outcome, and two will be the baseterms, so there is really 997 variables to loop through.

import dask
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

#make some toy data for the case you showed
df_train = pd.DataFrame(np.random.randint(low=0,high=10,size=(10000, 1000)))
df_train.columns = ['var'+str(x) for x in df_train.columns]
baseterms = 'var1+var2'
VarsOut = df_train.columns[3:]

Baseline for your current Code (20s +- 858ms):

%%timeit
TStatsOut=pd.DataFrame()

for i in VarsOut:
    xstrout='+'.join([baseterms,i])
    fout='var0~'+xstrout
    modout = smf.ols(fout, data=df_train).fit()
    j=pd.DataFrame(modout.pvalues,index=[i],columns=['PValue'])
    k=pd.DataFrame(modout.params,index=[i],columns=['Coeff'])
    s=pd.concat([j, k], axis=1)
    s=s.reindex(j.index)
    TStatsOut=TStatsOut.append(s)

Created a function for readability, but returns just the pval, and regression coefficient for each variable tested instead of the one line dataframes.

def testVar(i):
    xstrout='+'.join([baseterms,i])
    fout='var0~'+xstrout
    modout = smf.ols(fout, data=df_train).fit()
    pval=modout.pvalues[i]
    coef=modout.params[i]
    return pval, coef

Now runs at (14.1s +- 982ms)

%%timeit
pvals=[]
coefs=[]

for i in VarsOut:
    pval, coef = testVar(i)
    pvals.append(pval)
    coefs.append(coef)

TStatsOut = pd.DataFrame(data={'PValue':pvals, 'Coeff':coefs},
                         index=VarsOut)[['PValue','Coeff']]

Using Dask delayed for parallel processing. Keep in mind each delayed task that is created cause a slight overhead as well, so sometimes it it may not be beneficial, but will depend on your exact dataset and how long the regressions are taking. My data example may be too simple to show any benefit.

#define the same function as before, but tell dask how many outputs it has
@dask.delayed(nout=2)
def testVar(i):
    xstrout='+'.join([baseterms,i])
    fout='var0~'+xstrout
    modout = smf.ols(fout, data=df_train).fit()
    pval=modout.pvalues[i]
    coef=modout.params[i]
    return pval, coef

Now run through the 997 candidate variables and create the same dataframe with dask delayed. (18.6s +- 588ms)

%%timeit
pvals=[]
coefs=[]

for i in VarsOut:
    pval, coef = dask.delayed(testVar)(i)
    pvals.append(pval)
    coefs.append(coef)

pvals, coefs = dask.compute(pvals,coefs)    
TStatsOut = pd.DataFrame(data={'PValue':pvals, 'Coeff':coefs},
                         index=VarsOut)[['PValue','Coeff']]

Again, dask delayed creates more overhead as it creates the tasks to be sent across many processors, so any performance gain will depend on the time your data actually takes in the regressions as well as how many CPUs you have availible. Dask can be scaled from a single workstation to a cluster of workstations.

Upvotes: 2

Related Questions