Reputation: 1572
I am doing a text classification and I have very imbalanced data like
Category | Total Records
Cate1 | 950
Cate2 | 40
Cate3 | 10
Now I want to over sample Cate2 and Cate3 so it at least have 400-500 records, I prefer to use SMOTE over random sampling, Code
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
X_train, X_test, y_train, y_test = train_test_split(fewRecords['text'],
fewRecords['category'])
sm = SMOTE(random_state=12, ratio = 1.0)
x_train_res, y_train_res = sm.fit_sample(X_train, y_train)
It does not work as it can't generate the sample synthetic text, Now when I covert it into vector like
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(fewRecords['category'])
# transform the training and validation data using count vectorizer object
xtrain_count = count_vect.transform(X_train)
ytrain_train = count_vect.transform(y_train)
I am not sure if it is right approach and how to convert vector to real text when I want to predict real category after classification
Upvotes: 6
Views: 17527
Reputation: 622
I know this question is over 2 years old and I hope you found a resolution. If in case you are still interested, this could be easily done with imblearn pipelines.
I will proceed under the assumption that you will use a sklearn compatible estimator to perform the classification. Let's say Multinomial Naive Bayes.
Please note how I import Pipeline from imblearn and not sklearn
from imblearn.pipeline import Pipeline, make_pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
Import SMOTE as you've done in your code
from imblearn.over_sampling import SMOTE
Do the train-test split as you've done in your code
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(fewRecords['text'],
fewRecords['category'],stratify=fewRecords['category'], random_state=0
)
Create a pipeline with SMOTE as one of the components
textclassifier =Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('smote', SMOTE(random_state=12)),
('mnb', MultinomialNB(alpha =0.1))
])
Train the classifier on training data
textclassifier.fit(X_train, y_train)
Then you can use this classifier for any task including evaluating the classifier itself, predicting new observations etc.
e.g. predicting a new sample
textclassifier.predict(['sample text'])
would return a predicted category.
For a more accurate model try word vectors as features or more conveniently, perform hyperparameter optimization on the pipeline.
Upvotes: 7
Reputation: 1074
You need first to transform your text document into fixed length numerical vector, then do anything you want. Try LDA or Doc2Vec.
Upvotes: 3