Reputation: 2103
I am trying to solve a text classification problem. I want to create baseline model using MultinomialNB
my data is highly imbalnced for few categories, hence decided to use the imbalanced library with sklearn pipeline and referring the tutorial.
The model is failing and giving error after introducing the two stages in pipeline as suggested in docs.
from imblearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from imblearn.under_sampling import (EditedNearestNeighbours,
RepeatedEditedNearestNeighbours)
# Create the samplers
enn = EditedNearestNeighbours()
renn = RepeatedEditedNearestNeighbours()
pipe = make_pipeline_imb([('vect', CountVectorizer(max_features=100000,\
ngram_range= (1, 2),tokenizer=tokenize_and_stem)),\
('tfidf', TfidfTransformer(use_idf= True)),\
('enn', EditedNearestNeighbours()),\
('renn', RepeatedEditedNearestNeighbours()),\
('clf-gnb', MultinomialNB()),])
Error:
TypeError: Last step of Pipeline should implement fit. '[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
Can someone please help here. I am also open to use different way of (Boosting/SMOTE) implementation as well ?
Upvotes: 2
Views: 2510
Reputation: 3591
It seems that the pipeline from ìmblearn doesn't support naming like the one in sklearn. From imblearn documentation :
*steps : list of estimators.
You should modify your code to :
pipe = make_pipeline_imb( CountVectorizer(max_features=100000,\
ngram_range= (1, 2),tokenizer=tokenize_and_stem),\
TfidfTransformer(use_idf= True),\
EditedNearestNeighbours(),\
RepeatedEditedNearestNeighbours(),\
MultinomialNB())
Upvotes: 2