Reputation: 718
I am trying to make a MNB classifier for sentiment analysis. I had a dataset that consists of text and label in the following structure where labels are from 1-5. Using huggingface emotions dataset.
feature label
"I feel good" 1
I was able to do it using only my train dataset and using train_test_split function of sklearn. But there is a problem when I try to do it with my dataset which gives
ValueError: X has 3427 features, but MultinomialNB is expecting 10052 features as input.
on last line of the following code (predict)
cv = CountVectorizer(stop_words='english')
val_ppd_df = cv.fit_transform(val_df["lemmatized"])
val_labels = np.array(val_df['label'])
train_labels = np.array(train_df['label'])
mnb = MultinomialNB()
mnb.fit(train_ppd_df,train_labels)
predictions_NB = mnb.predict(val_ppd_df)
What I do is I do every operation (tokenization, stemming, lemmatization) to my validation dataset, but instead of doing test_train split I just split the labels of train and validation datasets. I checked what would come out of train_test_split and what val_ppd_df has and I noticed that they are different.
<16000x10052 sparse matrix of type '<class 'numpy.int64'>'
with 128627 stored elements in Compressed Sparse Row format>
<2000x3427 sparse matrix of type '<class 'numpy.int64'>'
with 15853 stored elements in Compressed Sparse Row format>
How can I handle this difference? Every example on internet uses train_test_split and mine works okay on it but I want to do this first on validation then on a different test dataset, not only on train dataset.
Upvotes: 2
Views: 403
Reputation: 4273
fit_transform
should only be applied to training data. For validation and testing: apply the transform
method.
MRE with Hugging Face - SetFit/emotion:
from datasets import load_dataset
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
# Load emotions dataset
emotions = load_dataset("SetFit/emotion")
train = emotions['train']
validation = emotions['validation']
# Create X_train using `cv.fit_transform`
cv = CountVectorizer(stop_words="english")
X_train = cv.fit_transform(train["text"])
# Fit Multinomial Naive Bayes
nb = MultinomialNB().fit(X_train, train["label"])
# Estimate performance on the validation set
X_valid = cv.transform(validation["text"])
print(nb.score(X_valid, validation["label"]))
# 0.797
Upvotes: 0