Antoni Parellada
Antoni Parellada

Reputation: 4791

How to select columns of a data base to call a linear regression (OLS and lasso) in sklearn

I am not comfortable with Python - much less intimidated and at ease with R. So indulge me on a silly question that is taking me a ton of searches without success.

I want to fit in a regression model with sklearn both with OLS and lasso. In particular, I like the mtcars dataset that is so easy to call in R, and, as it turns out, also very accessible in Python:

import statsmodels.api as sm
import pandas as pd
import statsmodels.formula.api as smf


mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df = pd.DataFrame(mtcars)

It looks like this:

                      mpg  cyl   disp   hp  drat  ...   qsec  vs  am  gear  carb
Mazda RX4            21.0    6  160.0  110  3.90  ...  16.46   0   1     4     4
Mazda RX4 Wag        21.0    6  160.0  110  3.90  ...  17.02   0   1     4     4
Datsun 710           22.8    4  108.0   93  3.85  ...  18.61   1   1     4     1
Hornet 4 Drive       21.4    6  258.0  110  3.08  ...  19.44   1   0     3     1

In trying to use LinearRegression() the usual structure found is

import numpy as np
from sklearn.linear_model import LinearRegression

model = LinearRegression().fit(x, y)

but to do so, I need to select several columns of df to fit into the regressors x, and a column to be the independent variable y. For example, I'd like to get an x matrix that includes a column of 1's (for the intercept) as well as the disp and qsec (numerical variables), as well as cyl (categorical variable). On the side of the independent variable, I'd like to use mpg.

It would look if it were possible to word this way as

model = LinearRegression().fit(mpg ~['disp', 'qsec', C('cyl')], data=df)

But how do I go about the syntax for it?

Similarly, how can I do the same with lasso:

from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.001)
lasso.fit(mpg ~['disp', 'qsec', C('cyl')], data=df)

but again this is not the right syntax.


I did find that you can get the actual regression (OLS or lasso) by turning the dataframe into a matrix. However, the names of the columns are gone, and it is hard to read the variable corresponding to each coefficients. And I still haven't found a simple method to run diagnostic values, like p-values, or the r-square to begin with.

Upvotes: 1

Views: 1689

Answers (2)

StupidWolf
StupidWolf

Reputation: 46908

You can maybe try patsy which is used by statsmodels:

import statsmodels.api as sm
import pandas as pd
import statsmodels.formula.api as smf
from patsy import dmatrix

mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data

mat = dmatrix("disp + qsec + C(cyl)", mtcars)

Looks like this, we can omit first column intercept since it is included in sklearn:

mat
 
DesignMatrix with shape (32, 5)
  Intercept  C(cyl)[T.6]  C(cyl)[T.8]   disp   qsec
          1            1            0  160.0  16.46
          1            1            0  160.0  17.02
          1            0            0  108.0  18.61
          1            1            0  258.0  19.44
          1            0            1  360.0  17.02

X = pd.DataFrame(mat[:,1:],columns = mat.design_info.column_names[1:])

from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X,mtcars['mpg'])

But the parameters names in model.coef_ will not be named. You just have to put them into a series to read them maybe:

pd.Series(model.coef_,index = X.columns)
 
C(cyl)[T.6]   -5.087564
C(cyl)[T.8]   -5.535554
disp          -0.025860
qsec          -0.162425

Pvalues from sklearn linear regression, there's no ready method to do it, you can check out these answers, maybe one of them is what you are looking for.

Upvotes: 1

Antoni Parellada
Antoni Parellada

Reputation: 4791

Here are two ways - unsatisfactory, especially because the variables labels seem to be gone once the regression gets going:

import statsmodels.api as sm
import pandas as pd
import statsmodels.formula.api as smf


mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df = pd.DataFrame(mtcars)

import numpy as np
from sklearn.linear_model import LinearRegression

Single variable regression mpg (i.v.) ~ hp (d.v.):

lm = LinearRegression()
 
mat = np.matrix(df)
 
lmFit = lm.fit(mat[:,3], mat[:,0])
 
print(lmFit.coef_)
print(lmFit.intercept_)

For multiple regression drat ~ wt + cyl + carb:

lmm = LinearRegression()
wt = np.array(df['wt'])
cyl = np.array(df['cyl'])
carb = np.array(df['carb'])
stack = np.column_stack((cyl,wt,carb))
stackmat = np.matrix(stack)

lmFit2 = lmm.fit(stackmat,mat[:,4])
print(lmFit2.coef_)
print(lmFit2.intercept_)

Upvotes: 0

Related Questions