Sklearn: Found input variables with inconsistent numbers of samples:

Question

I have built a model.

est1_pre = ColumnTransformer([('catONEHOT', OneHotEncoder(dtype='int',handle_unknown='ignore'),['Var1'])],remainder='drop')
est2_pre = ColumnTransformer([('BOW', TfidfVectorizer(ngram_range=(1, 3),max_features=1000),['Var2'])],remainder='drop')


    m1= Pipeline([('FeaturePreprocessing', est1_pre),
                              ('clf',alternative)])
    m2= Pipeline([('FeaturePreprocessing', est2_pre),
                              ('clf',alternative)])
    model_combo = StackingClassifier(
         estimators=[('cate',m1),('text',m2)],
         final_estimator=RandomForestClassifier(n_estimators=10,
                                               random_state=42)
     )

I can successfully, fit and predict using m1 and m2. However, when I look at the combination model_combo Any attempt in calling .fit/.predict results in ValueError: Found input variables with inconsistent numbers of samples:

    model_fitted=model_combo.fit(x_train,y_train)

x_train contains Var1 and Var2 How to fit model_combo?

Ben Reiniger · Accepted Answer

The problem is that sklearn text preprocessors (TfidfVectorizer in this case) operate on one-dimensional data, not two-dimensional as most other preprocessors. So the vectorizer treats its input as an iterable of its columns, so there's only one "document". This can be fixed in the ColumnTransformer by specifying the column to operate on not in a list:

est2_pre = ColumnTransformer([('BOW', TfidfVectorizer(ngram_range=(1, 3),max_features=1000),'Var2')],remainder='drop')

Sklearn: Found input variables with inconsistent numbers of samples:

Answers (1)

Related Questions