Reputation: 490
I'm trying to make a pipeline that handles numeric, categorical, and text variables. I want the data to be outputted to a new dataframe before I run the classifier. I'm getting the following error
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 2499 and the array at index 2 has size 1
.
Note that 2499 is the size of my training data. If I remove the text_preprocessing
part of the pipeline my code works. Any ideas how I can get this to work? Thanks!
# Categorical pipeline
categorical_preprocessing = Pipeline(
[
('Imputation', SimpleImputer(strategy='constant', fill_value='?')),
('One Hot Encoding', OneHotEncoder(handle_unknown='ignore')),
]
)
# Numeric pipeline
numeric_preprocessing = Pipeline(
[
('Imputation', SimpleImputer(strategy='mean')),
('Scaling', StandardScaler())
]
)
text_preprocessing = Pipeline(
[
('Text',TfidfVectorizer())
]
)
# Creating preprocessing pipeline
preprocessing = make_column_transformer(
(numeric_features, numeric_preprocessing),
(categorical_features, categorical_preprocessing),
(text_features,text_preprocessing),
)
# Final pipeline
pipeline = Pipeline(
[('Preprocessing', preprocessing)]
)
test = pipeline.fit_transform(x_train)
Upvotes: 4
Views: 1321
Reputation: 16966
I think you had tried swapping the features and pipelines in make_column_transformer
but didn't change it back when you posted the question.
Considering that you had them in the right order (estimator
, column/s),
when vectorizers are given with list of column names in ColumnTransformer, this error would occur. Because all the vectorisers in sklearn take only 1D data / iterator / pd.Series
, it cannot handle / apply for multiple columns as such.
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
x_train = pd.DataFrame({'fruit': ['apple','orange', np.nan],
'score': [np.nan, 12, 98],
'summary': ['Great performance',
'fantastic performance',
'Could have been better']}
)
# Categorical pipeline
categorical_preprocessing = Pipeline(
[
('Imputation', SimpleImputer(strategy='constant', fill_value='?')),
('One Hot Encoding', OneHotEncoder(handle_unknown='ignore')),
]
)
# Numeric pipeline
numeric_preprocessing = Pipeline(
[
('Imputation', SimpleImputer(strategy='mean')),
('Scaling', StandardScaler())
]
)
text_preprocessing = Pipeline(
[
('Text',TfidfVectorizer())
]
)
# Creating preprocessing pipeline
preprocessing = make_column_transformer(
(numeric_preprocessing, ['score']),
(categorical_preprocessing, ['fruit']),
(text_preprocessing, 'summary'),
)
# Final pipeline
pipeline = Pipeline(
[('Preprocessing', preprocessing)]
)
test = pipeline.fit_transform(x_train)
If I change
(text_preprocessing, 'summary'),
to
(text_preprocessing, ['summary']),
it throws an
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 3 and the array at index 2 has size 1
Upvotes: 5