Reputation: 699
I am computing coefficients like this:
def estimate_multivariate(data, target):
x = np.array(data)
y = np.array(target)
inv = np.linalg.inv(np.dot(x.T,x))
beta = np.dot(np.dot(inv, x.T), y)
return beta
and get these results:
[[ 103.56793536] [ 63.93186848][-272.06215991][ 500.43324361] [ 327.45075839]]
However if I create the model with sklearn.linear_model I get these results:
[ 118.45775015 64.56441108 -256.20123986 500.43324362 327.45075841]
This only happens when I use
preprocessing.PolynomialFeatures(degree=2)
poly.fit_transform(x)
with a degree greater than 1. When I use the original data the coefficients of both methods are the same. What could account for this? Is there some truncation somewhere?
Upvotes: 0
Views: 263
Reputation: 40159
Just to check: which model from sklearn.linear_model
did you use? LinearRegression
? All the other regression models from that module are penalized that could explain the discrepancy.
Assuming this is using LinearRegression
, you should either:
make sure that you should have column in your data array with constant value 1
and treat the beta of that column as the intercept_
of the linear model,
or disable intercept fitting for the linear model: LinearRegression(fit_intercept=False).fit(data, target).coef_
Assuming you also took care of that, you should keep in mind that extracting polynomial features will significantly augment the number of features and if your number of samples is too small, the empirical covariance matrix will be ill conditioned and calling np.inv
will be very unstable. For reference LinearRegression
uses an iterated least square solver instead of the closed form formula involving np.inv
.
When n_features >> n_samples
you should use a penalized linear regression model such as sklearn.linear_model.Ridge
instead of ordinary least squares.
Upvotes: 2