Error when trying to use PolynomialFeatures in scikit-learn

Question

My code is returning an error when I use PolynomialFeatures:

poly1 = PolynomialFeatures(degree=1)
poly3 = PolynomialFeatures(degree=3)
poly6 = PolynomialFeatures(degree=6)
poly9 = PolynomialFeatures(degree=9)
X_train = X_train.reshape(-1,1)
y_train = y_train.reshape(-1,1)

predictions = []
predict = np.linspace(0,10,100)

x_poly1 = poly1.fit_transform(X_train).reshape(-1,1)
X_train1, X_test1, y_train1, y_test1 = train_test_split(x_poly1, y_train)
linreg1 = LinearRegression().fit(X_train1, y_train1)

x_poly3 = poly3.fit_transform(X_train).reshape(-1,1)
X_train3, X_test3, y_train3, y_test3 = train_test_split(x_poly3, y_train)
linreg3 = LinearRegression().fit(X_train3, y_train3)

x_poly6 = poly6.fit_transform(X_train).reshape(-1,1)
X_train6, X_test6, y_train6, y_test6 = train_test_split(x_poly6, y_train)
linreg6 = LinearRegression().fit(X_train6, y_train6)

x_poly9 = poly9.fit_transform(X_train).reshape(-1,1)
X_train9, X_test9, y_train9, y_test9 = train_test_split(x_poly9, y_train)
linreg9 = LinearRegression().fit(X_train9, y_train9)

predict1 = poly1.fit_transform(predict).reshape(-1,1)
predict3 = poly3.fit_transform(predict).reshape(-1,1)
predict6 = poly6.fit_transform(predict).reshape(-1,1)
predict9 = poly9.fit_transform(predict).reshape(-1,1)

ans1 = linreg1.predict(predict1)
ans3 = linreg3.predict(predict3)
ans6 = linreg6.predict(predict6)
ans9 = linreg9.predict(predict9)

np.concatenate(ans1, ans3, ans6, ans9)

or alternatively

for i in enumerate([1,3,6,9]):
    poly = PolynomialFeatures(degree=i)
    x_poly = poly.fit_transform(X_train).reshape(-1,1)
    X_train, X_test, y_train, y_test = train_test_split(x_poly, y_train)
    linreg = LinearRegression().fit(X_train1, y_train1)
    ans = linreg.predict(poly.fit_transform(predict).reshape(-1,1))


np.concatenate(ans1, ans3, ans6, ans9)

In my code, I am trying to append all the values to a list for later use, but I get an error:

    ValueError                                Traceback (most recent call last)
 in ()
     18 
     19 x_poly1 = poly1.fit_transform(X_train).reshape(-1,1)
---> 20 X_train1, X_test1, y_train1, y_test1 = train_test_split(x_poly1, y_train)
     21 linreg1 = LinearRegression().fit(X_train1, y_train1)
     22 

/opt/conda/lib/python3.6/site-packages/sklearn/model_selection/_split.py in train_test_split(*arrays, **options)
   1687         test_size = 0.25
   1688 
-> 1689     arrays = indexable(*arrays)
   1690 
   1691     if stratify is not None:

/opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py in indexable(*iterables)
    204         else:
    205             result.append(np.array(X))
--> 206     check_consistent_length(*result)
    207     return result
    208 

/opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
    179     if len(uniques) > 1:
    180         raise ValueError("Found input variables with inconsistent numbers of"
--> 181                          " samples: %r" % [int(l) for l in lengths])
    182 
    183 

ValueError: Found input variables with inconsistent numbers of samples: [22, 11]

What is the reason I am getting this error? I want the end result to be a array with shape (4, 100). Please ask if clarification is needed.

David Waterworth · Accepted Answer

There are a few things that look incorrect but it's hard to tell without a simple example which we can recreate, and the full error message indicating which line is producing it. I can see through that

You're calling poly1.fit_transform 4 times, suspect this is a copy paste error
predictions.append(ans1, ans3, ans6, ans9) should probably be np.concatenate((ans1, ans3, ans6, ans9), axis=1)

But you can see from the stack trace that the error is from line 20, the call to train_test_split. What it's saying are the lengths of x and y aren't consistent. The reason is the reshape you've added to poly1.fit_transform(X_train). In this case, it's taking the output from the transforms which has shape (n,2) - a matrix - and reshaping into (2*n,) - a vector - which is twice as long as the original X_train.

I would recommend learning how to create a Pipeline to combine the PolynomialFeatures and LinearRegression into a single object which you can fit and predict.

i.e. Take a look at Polynomial interpolation

model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

This code should work

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np

X_train = np.ones(200)
y_train = np.ones(200)

poly1 = PolynomialFeatures(degree=1)
poly3 = PolynomialFeatures(degree=3)
poly6 = PolynomialFeatures(degree=6)
poly9 = PolynomialFeatures(degree=9)
X_train = X_train.reshape(-1,1)
y_train = y_train.reshape(-1,1)

predictions = []
predict = np.linspace(0,10,100)

x_poly1 = poly1.fit_transform(X_train)  # removed reshape
X_train1, X_test1, y_train1, y_test1 = train_test_split(x_poly1, y_train)
linreg1 = LinearRegression().fit(X_train1, y_train1)

x_poly3 = poly3.fit_transform(X_train)  # removed reshape
X_train3, X_test3, y_train3, y_test3 = train_test_split(x_poly3, y_train)
linreg3 = LinearRegression().fit(X_train3, y_train3)

x_poly6 = poly6.fit_transform(X_train)  # removed reshape
X_train6, X_test6, y_train6, y_test6 = train_test_split(x_poly6, y_train)
linreg6 = LinearRegression().fit(X_train6, y_train6)

x_poly9 = poly9.fit_transform(X_train)  # removed reshape
X_train9, X_test9, y_train9, y_test9 = train_test_split(x_poly9, y_train)
linreg9 = LinearRegression().fit(X_train9, y_train9) # fixed incorrect X,y

predict1 = poly1.fit_transform(predict.reshape(-1,1))
predict3 = poly3.fit_transform(predict.reshape(-1,1)) # changed poly1 to poly3
predict6 = poly6.fit_transform(predict.reshape(-1,1)) # changed poly1 to poly6
predict9 = poly9.fit_transform(predict.reshape(-1,1)) # changed poly1 to poly9

ans1 = linreg1.predict(predict1)
ans3 = linreg3.predict(predict3)
ans6 = linreg6.predict(predict6)
ans9 = linreg9.predict(predict9)

np.concatenate((ans1, ans3, ans6, ans9), axis=1) # use concatenate

And using pipelines it becomes

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
import numpy as np

X = np.ones((20,1))
y = np.ones((20,1))

X_train, X_test, y_train, y_test = train_test_split(X, y)

def create_model(degree):
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    return model

predictions = []
for degree in [1,3,6,9]:
    model = create_model(degree)
    model.fit(X_train,y_train)
    predicted = model.predict(X_test)
    predictions.append(predicted)

predictions = np.concatenate(predictions, axis=1)
print(predictions.shape)

Error when trying to use PolynomialFeatures in scikit-learn

Answers (1)

Related Questions