I get an error when trying to find the best polynomial regression degree for my model

Question

I'm trying to create a function that finds the best n degree for my polynomial regression model using the root mean square error so I don't have to try fitting the model over and over again manually and then plot the outcome but I got 'x and y must have same first dimension, but have shapes (10, 1) and (1,)' error when i tried creating the plot.

Here's my code:

from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from math import sqrt
def poly_fit(n):
  s=[]
  for i in range(2,n):
    poly_reg = PolynomialFeatures(degree=i)
    x_poly = poly_reg.fit_transform(x)
    linreg2 = LinearRegression()
    linreg2.fit(x_poly,y)
    s.append(sqrt(mean_squared_error(y, linreg2.predict(x_poly))))
  return min(s)

plt.scatter(x, y, color='red')
plt.plot(x, poly_fit(6), color='blue')
plt.title('Polynomial Regression results')
plt.xlabel('Position/level')
plt.ylabel('Salary')
plt.show()

Bill Huang · Accepted Answer

If your x is one-dimensional column vector, np.polyfit() and np.polyval() will get the job done. np.polyfit(x,y,order,full=True) returns the residual (sum of residual squares, I believe) for you to check the optimal order. You don't need the second regression fit to get the residual.

N.B. AFAIK your logic of choosing the minimum residual is engineer-wise doable but not mathematically reasonable. This is because sum of square error (SSE) will always decrease with increasing number of regressors, so you will always get the result from the greatest polynomial order. You must try to use a formula with penalty for adding more terms upon choosing the model (e.g. AIC or BIC criteria). However, this part is completely at the free choice of the researcher, and is of course beyond the scope of the question itself.

import numpy as np
import matplotlib.pyplot as plt

# get your x and y as np array
x = np.random.uniform(-1,1, 100)
y = x - x**2 + x**3 - x**4 + np.random.normal(0, 0.1, 100)

def poly_fit(n):
  ls_res=[]
  ls_coeff = []
  for i in range(2, n):
    coeff, res, _, _, _ = np.polyfit(x, y, i, full=True)
    ls_res.append(res)
    ls_coeff.append(coeff)

  # argmin should be taken from a penalized loss function
  # add it here
  return ls_coeff[np.argmin(ls_res)]

plt.scatter(x, y, color='red')
coeff = poly_fit(6)
plt.plot(np.sort(x), np.polyval(coeff, np.sort(x)), color='blue')
plt.title('Polynomial Regression results')
plt.xlabel('Position/level')
plt.ylabel('Salary')
plt.show()

polyfit

I get an error when trying to find the best polynomial regression degree for my model

Answers (1)

Related Questions