PandaBearSoup
PandaBearSoup

Reputation: 699

Multivariate regression not getting same coefficients as sklearn

I am computing coefficients like this:

def estimate_multivariate(data, target):
    x = np.array(data)
    y = np.array(target)
    inv = np.linalg.inv(np.dot(x.T,x))
    beta = np.dot(np.dot(inv, x.T), y)
    return beta

and get these results:

[[ 103.56793536] [  63.93186848][-272.06215991][ 500.43324361] [ 327.45075839]]

However if I create the model with sklearn.linear_model I get these results:

[ 118.45775015   64.56441108 -256.20123986  500.43324362  327.45075841]

This only happens when I use

preprocessing.PolynomialFeatures(degree=2)
poly.fit_transform(x)

with a degree greater than 1. When I use the original data the coefficients of both methods are the same. What could account for this? Is there some truncation somewhere?

Upvotes: 0

Views: 263

Answers (1)

ogrisel
ogrisel

Reputation: 40159

Just to check: which model from sklearn.linear_model did you use? LinearRegression? All the other regression models from that module are penalized that could explain the discrepancy.

Assuming this is using LinearRegression, you should either:

  • make sure that you should have column in your data array with constant value 1 and treat the beta of that column as the intercept_ of the linear model,

  • or disable intercept fitting for the linear model: LinearRegression(fit_intercept=False).fit(data, target).coef_

Assuming you also took care of that, you should keep in mind that extracting polynomial features will significantly augment the number of features and if your number of samples is too small, the empirical covariance matrix will be ill conditioned and calling np.inv will be very unstable. For reference LinearRegression uses an iterated least square solver instead of the closed form formula involving np.inv.

When n_features >> n_samples you should use a penalized linear regression model such as sklearn.linear_model.Ridge instead of ordinary least squares.

Upvotes: 2

Related Questions