Reputation: 13
from sklearn.preprocessing import PolynomialFeatures
train_x_p = np.asanyarray(train[['FUELCONSUMPTION_COMB_MPG']])
poly = PolynomialFeatures(degree = 3)
train_x_poly = poly.fit_transform(train_x_p)
regr.fit(train_x_poly, train_y)
print('Coefficients: ', regr.coef_)
print('Intercept', regr.intercept_)
test_x_poly = poly.fit_transform(test_x)
test_y_poly1 = np.asanyarray(test[['CO2EMISSIONS']]) #im not sure especially about this line
test_y_hat_poly1 = regr.predict(test_x_poly)
mse = metrics.mean_squared_error(test_y_poly1, test_y_hat_poly1)
r2 = (r2_score(test_y_poly1,test_y_hat_poly1))
print('MSE&R2SQUARE polynomial linear regression (FUELCONSUMPTION_COMB_MPG): ')
print('MSE: ',mse)
print('r2-sq: ',r2)
and also what made me feel it's incorrect the results of mse should I transform the test y to poly and if I should how can I do it ?
Upvotes: 1
Views: 499
Reputation: 14462
No, you should not transform your y_true
values. What polynomial features does is that it takes x_1, x_2, ..., x_p
predictors and applies polynomial transformation of a chosen degree to each one of them.
If you have 2 predictors x_1 and x_2
and apply polynomial transformation of 3rd degree, you end up with problem of the form:
y = b_0 + b_1 * x_1 + b_2 * x_1^2 + b_3 * x_1^3 + b_4 * x_2 + b_5 * x_2^2 + b_6 * x_2^3
You want to do this when there is a non-linear relationship between predictors and the response and you want to use linear model to fit the data. y_true
stays the same whether you are using polynomial features or not (or most of the other regression models).
Your code is almost fine, except for one issue - you are calling fit_transform
on the test data which is something that you would never want to do. You have already fitted the polynomial features object on the training data, all you need to do is to call transform
method to transform your test data.
test_x_poly = poly.transform(test_x)
Here is an example of how it looks like when you use polynomial features when there is polynomial relationship between predictor and response.
X = np.random.randint(-100, 100, (100, 1))
y = X ** 2 + np.random.normal(size=(100, 1))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
poly_features = PolynomialFeatures(degree=2)
X_train_poly = poly_features.fit_transform(X_train) # transform the data as well
reg = LinearRegression()
reg.fit(X_train_poly, y_train)
reg_line_x = poly_features.transform(np.linspace(-100, 100, 1000).reshape((-1, 1)))
reg_line_y = reg.predict(reg_line_x)
plt.scatter(X_train_poly[:, 1].ravel(), y_train)
plt.plot(reg_line_x[:, 1].ravel(), reg_line_y, c="red", label="regression line")
plt.legend()
plt.show()
X_test
data and make the prediction# do NOT call fit_transform here
X_test_poly = poly_features.transform(X_test)
y_pred = reg.predict(X_test_poly)
There is also a more convenient way of doing this by building a pipeline that handles everything (that is polynomial transformation and regression in your case) so that you don't have to manually perform each individual step.
from sklearn.pipeline import Pipeline
pipe = Pipeline([
("poly_features", poly_features),
("regression", reg)
])
y_pred = pipe.predict(X_test)
print(f"r2 : {r2_score(y_test, y_pred)}")
print(f"mse: {mean_squared_error(y_test, y_pred)}")
r2 : 0.9999997923643911
mse: 1.4848830127345198
Note that the fact that r squared or MSE is showing poor values in your case doesn't mean that your code is wrong. It might be the case that your data is not suited for the task, or that you need to use different degree of polynomial transformation - you might be either underfitting or overfitting the training data etc.
Upvotes: 2