dense8
dense8

Reputation: 718

Multinomial Naive Bayes can't use validation dataset because of ValueError but can use sklearn train_test_split

I am trying to make a MNB classifier for sentiment analysis. I had a dataset that consists of text and label in the following structure where labels are from 1-5. Using huggingface emotions dataset.

feature                                   label
"I feel good"                             1

I was able to do it using only my train dataset and using train_test_split function of sklearn. But there is a problem when I try to do it with my dataset which gives

ValueError: X has 3427 features, but MultinomialNB is expecting 10052 features as input.

on last line of the following code (predict)

cv = CountVectorizer(stop_words='english')
val_ppd_df = cv.fit_transform(val_df["lemmatized"])
val_labels = np.array(val_df['label'])
train_labels = np.array(train_df['label'])
mnb = MultinomialNB()
mnb.fit(train_ppd_df,train_labels)
predictions_NB = mnb.predict(val_ppd_df)

What I do is I do every operation (tokenization, stemming, lemmatization) to my validation dataset, but instead of doing test_train split I just split the labels of train and validation datasets. I checked what would come out of train_test_split and what val_ppd_df has and I noticed that they are different.

<16000x10052 sparse matrix of type '<class 'numpy.int64'>'
    with 128627 stored elements in Compressed Sparse Row format>
<2000x3427 sparse matrix of type '<class 'numpy.int64'>'
    with 15853 stored elements in Compressed Sparse Row format>

How can I handle this difference? Every example on internet uses train_test_split and mine works okay on it but I want to do this first on validation then on a different test dataset, not only on train dataset.

Upvotes: 2

Views: 403

Answers (1)

Alexander L. Hayes
Alexander L. Hayes

Reputation: 4273

fit_transform should only be applied to training data. For validation and testing: apply the transform method.

MRE with Hugging Face - SetFit/emotion:

from datasets import load_dataset
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Load emotions dataset
emotions = load_dataset("SetFit/emotion")
train = emotions['train']
validation = emotions['validation']

# Create X_train using `cv.fit_transform`
cv = CountVectorizer(stop_words="english")
X_train = cv.fit_transform(train["text"])

# Fit Multinomial Naive Bayes
nb = MultinomialNB().fit(X_train, train["label"])

# Estimate performance on the validation set
X_valid = cv.transform(validation["text"])
print(nb.score(X_valid, validation["label"]))
# 0.797

Upvotes: 0

Related Questions