How to use the imbalanced library with sklearn pipeline?

Question

I am trying to solve a text classification problem. I want to create baseline model using MultinomialNB

my data is highly imbalnced for few categories, hence decided to use the imbalanced library with sklearn pipeline and referring the tutorial.

The model is failing and giving error after introducing the two stages in pipeline as suggested in docs.

from imblearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from imblearn.under_sampling import (EditedNearestNeighbours,
                                     RepeatedEditedNearestNeighbours)
# Create the samplers
enn = EditedNearestNeighbours()
renn = RepeatedEditedNearestNeighbours()

pipe = make_pipeline_imb([('vect', CountVectorizer(max_features=100000,\
                                         ngram_range= (1, 2),tokenizer=tokenize_and_stem)),\
                         ('tfidf', TfidfTransformer(use_idf= True)),\
                          ('enn', EditedNearestNeighbours()),\
                          ('renn', RepeatedEditedNearestNeighbours()),\
                          ('clf-gnb',  MultinomialNB()),])

Error:

TypeError: Last step of Pipeline should implement fit. '[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',

Can someone please help here. I am also open to use different way of (Boosting/SMOTE) implementation as well ?

How to use the imbalanced library with sklearn pipeline?

Answers (1)

Related Questions