Reputation: 65
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer
data = [[1, 3, 4, 'text', 'pos'], [9, 3, 6, 'text more', 'neg']]
data = pd.DataFrame(data, columns=['Num1', 'Num2', 'Num3', 'Text field', 'Class'])
tweet_text_transformer = Pipeline(steps=[
('count_vectoriser', CountVectorizer()),
('tfidf', TfidfTransformer())
])
numeric_transformer = Pipeline(steps=[
('scaler', MinMaxScaler())
])
preprocessor = ColumnTransformer(transformers=[
# (name, transformer, column(s))
('tweet', tweet_text_transformer, ['Text field']),
('numeric', numeric_transformer, ['Num1', 'Num2', 'Num3'])
])
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LinearSVC())
])
X_train = data.loc[:, 'Num1':'Text field']
y_train = data['Class']
pipeline.fit(X_train, y_train)
I don't understand where this error is coming from:
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 1 and the array at index 1 has size 2
Upvotes: 6
Views: 2446
Reputation: 21
I implemented your code solution to convert the sparse matrix to an array and it fixed the error, however, when I call predict it shows another error
model = pipeline.fit(X_train,y_train)
y_pred = model.predict(X_test)
it give me this error
ValueError: X has 574 features per sample; expecting 493
My understanding that in this case it is not using the trained vectorizer model, but train a new one on the X_test dataset. How can I fix that, I don't know
NOTE: Need to add import statement for both BaseEstimator, TransformerMixin
To fix this problem use FunctionTransformer instead of defining a class
Using FunctionTransformer instead of defining a class
from sklearn.preprocessing import FunctionTransformer
vectorizer_params = dict(ngram_range=(1, 2), min_df=5, max_df=0.8)
TweetTextProcessor = Pipeline(steps=[
("squeez", FunctionTransformer(lambda x: x.squeeze())),
("vect", CountVectorizer(**vectorizer_params)),
("tfidf", TfidfTransformer()),
("toarray", FunctionTransformer(lambda x: x.toarray())),
])
numeric_transformer = Pipeline(steps=[
('scaler', MinMaxScaler())
])
preprocessor = ColumnTransformer(transformers=[
('tweet', TweetTextProcessor, ['Text field']),
('numeric', numeric_transformer, ['Num1', 'Num2', 'Num3'])
])
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LinearSVC())
])
Upvotes: 2
Reputation: 51
The issue is in the preprocessor
pipeline, The way this pipeline works is the output of tweet_text_transformer
and the output of numeric_transformer
are stacked horizontally, For this to successfully happen both the outputs(tweet_text_transformer and numeric_transformer) must have the same number of rows(ie: number of elements in axis 0 or dimension-0)
But when the above pipeline is executed the tweet_text_processor
, though we expect it to give 2 * 2 matrix with 4 elements in reality since CountVectorizer stores the output as sparse matrix it removes any zeroes in the matrix(to save memory) this reduces the array to 2*2 matrix but with only 3 elements in it and when this to be stacked with the output of numeric_transformer it does not satisfy the above mentioned condition(since numeric transformer would have two elements in axis 0 and the twwet_text_processor would not)
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer
data = [[1, 3, 4, 'text', 'pos'], [9, 3, 6, 'text more', 'neg']]
data = pd.DataFrame(data, columns=['Num1', 'Num2', 'Num3', 'Text field', 'Class'])
class TweetTextProcessor(BaseEstimator, TransformerMixin):
def __init__(self):
self.tweet_text_transformer = Pipeline(steps=[
('count_vectoriser', CountVectorizer()),
('tfidf', TfidfTransformer()) ])
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
return self.tweet_text_transformer.fit_transform(X.squeeze()).toarray()
numeric_transformer = Pipeline(steps=[
('scaler', MinMaxScaler())
])
preprocessor = ColumnTransformer(transformers=[
('tweet', TweetTextProcessor(), ['Text field']),
('numeric', numeric_transformer, ['Num1', 'Num2', 'Num3'])
])
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LinearSVC())
])
X_train = data.loc[:, 'Num1':'Text field']
y_train = data['Class']
pipeline.fit(X_train, y_train)
The above code should work, Let me know otherwise or if the explanation was not clear(hopefully it is)
Upvotes: 5