natnay
natnay

Reputation: 490

Making a ColumnTransformer with numeric, categorical, and text pipeline

I'm trying to make a pipeline that handles numeric, categorical, and text variables. I want the data to be outputted to a new dataframe before I run the classifier. I'm getting the following error

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 2499 and the array at index 2 has size 1.

Note that 2499 is the size of my training data. If I remove the text_preprocessing part of the pipeline my code works. Any ideas how I can get this to work? Thanks!

# Categorical pipeline
categorical_preprocessing = Pipeline(
[
    ('Imputation', SimpleImputer(strategy='constant', fill_value='?')),
    ('One Hot Encoding', OneHotEncoder(handle_unknown='ignore')),
]
)

# Numeric pipeline
numeric_preprocessing = Pipeline(
[
     ('Imputation', SimpleImputer(strategy='mean')),
     ('Scaling', StandardScaler())
]
)

text_preprocessing = Pipeline(
[
     ('Text',TfidfVectorizer())       
]
)

# Creating preprocessing pipeline
preprocessing = make_column_transformer(
     (numeric_features, numeric_preprocessing),
     (categorical_features, categorical_preprocessing),
     (text_features,text_preprocessing),
)

# Final pipeline
pipeline = Pipeline(
[('Preprocessing', preprocessing)]
)

test = pipeline.fit_transform(x_train)

Upvotes: 4

Views: 1321

Answers (1)

Venkatachalam
Venkatachalam

Reputation: 16966

I think you had tried swapping the features and pipelines in make_column_transformer but didn't change it back when you posted the question.

Considering that you had them in the right order (estimator, column/s), when vectorizers are given with list of column names in ColumnTransformer, this error would occur. Because all the vectorisers in sklearn take only 1D data / iterator / pd.Series, it cannot handle / apply for multiple columns as such.

Example:

import pandas as pd
import numpy as np

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

x_train = pd.DataFrame({'fruit': ['apple','orange', np.nan],
                        'score': [np.nan, 12, 98],
                        'summary': ['Great performance', 
                                    'fantastic performance',
                                    'Could have been better']}
                        )

# Categorical pipeline
categorical_preprocessing = Pipeline(
[
    ('Imputation', SimpleImputer(strategy='constant', fill_value='?')),
    ('One Hot Encoding', OneHotEncoder(handle_unknown='ignore')),
]
)

# Numeric pipeline
numeric_preprocessing = Pipeline(
[
     ('Imputation', SimpleImputer(strategy='mean')),
     ('Scaling', StandardScaler())
]
)

text_preprocessing = Pipeline(
[
     ('Text',TfidfVectorizer())       
]
)

# Creating preprocessing pipeline
preprocessing = make_column_transformer(
     (numeric_preprocessing, ['score']),
     (categorical_preprocessing, ['fruit']),
     (text_preprocessing, 'summary'),
)

# Final pipeline
pipeline = Pipeline(
[('Preprocessing', preprocessing)]
)

test = pipeline.fit_transform(x_train)

If I change

    (text_preprocessing, 'summary'),

to

    (text_preprocessing, ['summary']),

it throws an

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 3 and the array at index 2 has size 1

Upvotes: 5

Related Questions