00__00__00
00__00__00

Reputation: 5367

Sklearn: Found input variables with inconsistent numbers of samples:

I have built a model.

est1_pre = ColumnTransformer([('catONEHOT', OneHotEncoder(dtype='int',handle_unknown='ignore'),['Var1'])],remainder='drop')
est2_pre = ColumnTransformer([('BOW', TfidfVectorizer(ngram_range=(1, 3),max_features=1000),['Var2'])],remainder='drop')


    m1= Pipeline([('FeaturePreprocessing', est1_pre),
                              ('clf',alternative)])
    m2= Pipeline([('FeaturePreprocessing', est2_pre),
                              ('clf',alternative)])
    model_combo = StackingClassifier(
         estimators=[('cate',m1),('text',m2)],
         final_estimator=RandomForestClassifier(n_estimators=10,
                                               random_state=42)
     )

I can successfully, fit and predict using m1 and m2. However, when I look at the combination model_combo Any attempt in calling .fit/.predict results in ValueError: Found input variables with inconsistent numbers of samples:

    model_fitted=model_combo.fit(x_train,y_train)

x_train contains Var1 and Var2 How to fit model_combo?

Upvotes: 0

Views: 531

Answers (1)

Ben Reiniger
Ben Reiniger

Reputation: 12748

The problem is that sklearn text preprocessors (TfidfVectorizer in this case) operate on one-dimensional data, not two-dimensional as most other preprocessors. So the vectorizer treats its input as an iterable of its columns, so there's only one "document". This can be fixed in the ColumnTransformer by specifying the column to operate on not in a list:

est2_pre = ColumnTransformer([('BOW', TfidfVectorizer(ngram_range=(1, 3),max_features=1000),'Var2')],remainder='drop') 

Upvotes: 1

Related Questions