Shivam Agrawal
Shivam Agrawal

Reputation: 2103

How to use the imbalanced library with sklearn pipeline?

I am trying to solve a text classification problem. I want to create baseline model using MultinomialNB

my data is highly imbalnced for few categories, hence decided to use the imbalanced library with sklearn pipeline and referring the tutorial.

The model is failing and giving error after introducing the two stages in pipeline as suggested in docs.

from imblearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from imblearn.under_sampling import (EditedNearestNeighbours,
                                     RepeatedEditedNearestNeighbours)
# Create the samplers
enn = EditedNearestNeighbours()
renn = RepeatedEditedNearestNeighbours()

pipe = make_pipeline_imb([('vect', CountVectorizer(max_features=100000,\
                                         ngram_range= (1, 2),tokenizer=tokenize_and_stem)),\
                         ('tfidf', TfidfTransformer(use_idf= True)),\
                          ('enn', EditedNearestNeighbours()),\
                          ('renn', RepeatedEditedNearestNeighbours()),\
                          ('clf-gnb',  MultinomialNB()),])

Error:

TypeError: Last step of Pipeline should implement fit. '[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',

Can someone please help here. I am also open to use different way of (Boosting/SMOTE) implementation as well ?

Upvotes: 2

Views: 2510

Answers (1)

CoMartel
CoMartel

Reputation: 3591

It seems that the pipeline from ìmblearn doesn't support naming like the one in sklearn. From imblearn documentation :

*steps : list of estimators.

You should modify your code to :

pipe = make_pipeline_imb( CountVectorizer(max_features=100000,\
                                         ngram_range= (1, 2),tokenizer=tokenize_and_stem),\
                         TfidfTransformer(use_idf= True),\
                         EditedNearestNeighbours(),\
                         RepeatedEditedNearestNeighbours(),\
                         MultinomialNB())

Upvotes: 2

Related Questions