rrryok
rrryok

Reputation: 65

Sklearn pipeline transform specific columns - ValueError: too many values to unpack (expected 2)

i am trying make pipeline with scaler, onhotencoder, polynomialfeature, and finally linear regression model

from sklearn.pipeline import Pipeline
pipeline = Pipeline([
                    ('scaler', StandardScaler(), num_cols),
                    ('polynom', PolynomialFeatures(3), num_cols), 
                    ('encoder', OneHotEncoder(), cat_cols),
                   ('linear_regression', LinearRegression() )
])

but when i fit the pipeline i have ValueError: too many values to unpack (expected 2)

pipeline.fit(x_train,y_train)
pipeline.score(x_test, y_test)

Upvotes: 2

Views: 1435

Answers (1)

user2246849
user2246849

Reputation: 4407

If I understand correctly, you want to apply some steps of the pipeline to specific columns. Instead of doing it by adding the column names ad the end of the pipeline stage (which is incorrect and causes the error), you have to use a ColumnTransformer. Here you can find another similar example.

In your case, you could do something like this:

import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer

# Fake data.
train_data = pd.DataFrame({'n1': range(10), 'n2': range(10)})
train_data['c1'] = 0
train_data['c1'][5:] = 1
y_train = [0]*10
y_train[5:] = [1]*5

# Here I assumed you are using a DataFrame. If not, use integer indices instead of column names.
num_cols = ['n1', 'n2']
cat_cols = ['c1']


# Pipeline to transform the numerical features.
numerical_transformer = Pipeline([('scaler', StandardScaler()),
                                  ('polynom', PolynomialFeatures(3))
    
])

# Apply the numerical transformer only on the numerical columns.
# Spearately, apply the OneHotEncoder.
ct = ColumnTransformer([('num_transformer', numerical_transformer, num_cols),
                        ('encoder', OneHotEncoder(), cat_cols)])

# Main pipeline for fitting.
pipeline = Pipeline([
                   ('column_transformer', ct),
                   ('linear_regression', LinearRegression() )
])

pipeline.fit(train_data, y_train)

Schematically, the layout of your pipeline would be like this:

enter image description here

Upvotes: 4

Related Questions