Reputation: 73

Standard errors for multivariate regression coefficients

I've done a multivariate regression using sklearn.linear_model.LinearRegression and obtained the regression coefficients doing this:

    import numpy as np
    from sklearn import linear_model
    clf = linear_model.LinearRegression()
    TST = np.vstack([x1,x2,x3,x4])
    TST = TST.transpose()
    clf.fit (TST,y)
    clf.coef_

Now, I need the standard errors for these same coefficients. How can I do that? Thanks a lot.

Upvotes: 7

Answers (5)

f0xdx

Reputation: 1469

I found that the accepted answer had some mathematical glitches that in total would require edits beyond the recommended etiquette for modifying posts. So here is a solution to compute the standard error estimate for the coefficients obtained through the linear model (using an unbiased estimate as suggested here):

# preparation
X = np.concatenate((np.ones(TST.shape[0], 1)), TST), axis=1)
y_hat = clf.predict(TST).T
m, n = X.shape

# computation
MSE = np.sum((y_hat - y)**2)/(m - n)
coef_var_est = MSE * np.diag(np.linalg.pinv(np.dot(X.T,X)))
coef_SE_est = np.sqrt(var_est)

Note that we have to add a column of ones to TST as the original post used the linear_model.LinearRegression in a way that will fit the intercept term. Furthermore, we need to compute the mean squared error (MSE) as in ANOVA. That is, we need to divide the sum of squared errors (SSE) by the degrees of freedom for the error, i.e., df_error = df_observations - df_features.

The resulting array coef_SE_est contains the standard error estimates of the intercept and all other coefficients in coef_SE_est[0] and coef_SE_est[1:] resp. To print them out you could use

print('intercept: coef={:.4f} / std_err={:.4f}'.format(clf.intercept_[0], coef_SE_est[0]))
for i, coef in enumerate(clf.coef_[0,:]):
    print('x{}: coef={:.4f} / std_err={:.4f}'.format(i+1, coef, coef_SE_est[i+1]))

Upvotes: 0

Linna Du

Reputation: 21

These code has been tested with data. They are correct.

find the X matrix for each data set, n is the length of dataset, m is the variables number

X, n, m=arrays(data) 
y=***.reshape((n,1))
linear = linear_model.LinearRegression()
linear.fit(X, y , n_jobs=-1) ## delete n_jobs=-1, if it's one variable only.

sum square

s=np.sum((linear.predict(X) - y) ** 2)/(n-(m-1)-1)

standard deviation, square root of the diagonal of variance-co-variance matrix (sigular vector decomposition)

sd_alpha=np.sqrt(s*(np.diag(np.linalg.pinv(np.dot(X.T,X)))))

(t-statistics using, linear.intercept_ for one variable)

t_stat_alpha=linear.intercept_[0]/sd_alpha[0]  #( use linear.intercept_ for one variable_

Upvotes: 0

Luca

Reputation: 29

MSE = np.mean((y - clf.predict(TST).T)**2)
var_est = MSE * np.diag(np.linalg.pinv(np.dot(TST.T,TST)))
SE_est = np.sqrt(var_est)

I guess that this answer is not entirely correct. In particular, if I am not wrong, according to your code sklearn is adding the constant term in order to compute your coefficient by default.

Then, you need to include in your matrix TST the column of ones. Then, the code is correct and it will give you an array with all the SE

Upvotes: 1

Kyler Brown

Reputation: 1126

Based on this stats question and wikipedia, my best guess is:

MSE = np.mean((y - clf.predict(TST).T)**2)
var_est = MSE * np.diag(np.linalg.pinv(np.dot(TST.T,TST)))
SE_est = np.sqrt(var_est)

However, my linear algebra and stats are both quite poor, so I could be missing something important. Another option might be to bootstrap the variance estimate.

Upvotes: 5

jonrsharpe

Reputation: 121974

The example from the documentation shows how to get the mean square error and explained variance score:

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# The coefficients
print('Coefficients: \n', regr.coef_)

# The mean square error
print("Residual sum of squares: %.2f"
      % np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test))

Does this cover what you need?

Upvotes: -1