Ben
Ben

Reputation: 401

Python Regression Variable Selection

I have a basic linear regression with 80 numerical variables (no classification variables). Training set has 1600 rows, testing 700.

I would like a python package that iterates through all column combinations to find the best custom score function or an out of the box score funtion like AIC. OR If that doesnt exist, what do people here use for variable selection? I know R has some packages like this but dont want deal with Rpy2

I have no preference if the LM requires scikit learn, numpy, pandas, statsmodels, or other.

Upvotes: 4

Views: 2188

Answers (1)

Takahiro Yoshizawa
Takahiro Yoshizawa

Reputation: 111

I can suggest an answer that using the Least Absolute Shrinkage and Selection Operator(Lasso). I didn't use in a situation like you, that you have to deal with so many data.

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html

I often write a code to do linear regression with statsmodels like below,

import statsmodels.api as sm

model = sm.OLS()
results = model.fit(train_X,train_Y)

If I want to do Lasso regression, I write a code like below,

from sklearn import linear_model

model = linear_model.Lasso(alpha=1.0(default))
results = model.fit(train_X,train_Y)

You have to decide appropriate alpha between 0.0 and 1.0. The parameter is determined by how you don't accept the error.

Try this.

Upvotes: 4

Related Questions