Vineet
Vineet

Reputation: 1572

SMOTE, Oversampling on text classification in Python

I am doing a text classification and I have very imbalanced data like

Category | Total Records
Cate1    | 950
Cate2    |  40
Cate3    |  10

Now I want to over sample Cate2 and Cate3 so it at least have 400-500 records, I prefer to use SMOTE over random sampling, Code

from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
X_train, X_test, y_train, y_test = train_test_split(fewRecords['text'],
                                   fewRecords['category'])

sm = SMOTE(random_state=12, ratio = 1.0)
x_train_res, y_train_res = sm.fit_sample(X_train, y_train)

It does not work as it can't generate the sample synthetic text, Now when I covert it into vector like

count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(fewRecords['category'])

# transform the training and validation data using count vectorizer object
xtrain_count =  count_vect.transform(X_train)
ytrain_train =  count_vect.transform(y_train)

I am not sure if it is right approach and how to convert vector to real text when I want to predict real category after classification

Upvotes: 6

Views: 17527

Answers (2)

AviS
AviS

Reputation: 622

I know this question is over 2 years old and I hope you found a resolution. If in case you are still interested, this could be easily done with imblearn pipelines.

I will proceed under the assumption that you will use a sklearn compatible estimator to perform the classification. Let's say Multinomial Naive Bayes.

Please note how I import Pipeline from imblearn and not sklearn

from imblearn.pipeline import Pipeline, make_pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

Import SMOTE as you've done in your code

from imblearn.over_sampling import SMOTE

Do the train-test split as you've done in your code

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(fewRecords['text'],
                               fewRecords['category'],stratify=fewRecords['category'], random_state=0
)

Create a pipeline with SMOTE as one of the components

textclassifier =Pipeline([
  ('vect', CountVectorizer()),
   ('tfidf', TfidfTransformer()),
   ('smote', SMOTE(random_state=12)),
   ('mnb', MultinomialNB(alpha =0.1))
])

Train the classifier on training data

textclassifier.fit(X_train, y_train)

Then you can use this classifier for any task including evaluating the classifier itself, predicting new observations etc.

e.g. predicting a new sample

 textclassifier.predict(['sample text'])

would return a predicted category.

For a more accurate model try word vectors as features or more conveniently, perform hyperparameter optimization on the pipeline.

Upvotes: 7

allenyllee
allenyllee

Reputation: 1074

You need first to transform your text document into fixed length numerical vector, then do anything you want. Try LDA or Doc2Vec.

Upvotes: 3

Related Questions