Multinomial Naive Bayes can't use validation dataset because of ValueError but can use sklearn train_test_split

Question

I am trying to make a MNB classifier for sentiment analysis. I had a dataset that consists of text and label in the following structure where labels are from 1-5. Using huggingface emotions dataset.

feature                                   label
"I feel good"                             1

I was able to do it using only my train dataset and using train_test_split function of sklearn. But there is a problem when I try to do it with my dataset which gives

ValueError: X has 3427 features, but MultinomialNB is expecting 10052 features as input.

on last line of the following code (predict)

cv = CountVectorizer(stop_words='english')
val_ppd_df = cv.fit_transform(val_df["lemmatized"])
val_labels = np.array(val_df['label'])
train_labels = np.array(train_df['label'])
mnb = MultinomialNB()
mnb.fit(train_ppd_df,train_labels)
predictions_NB = mnb.predict(val_ppd_df)

What I do is I do every operation (tokenization, stemming, lemmatization) to my validation dataset, but instead of doing test_train split I just split the labels of train and validation datasets. I checked what would come out of train_test_split and what val_ppd_df has and I noticed that they are different.

<16000x10052 sparse matrix of type ''
    with 128627 stored elements in Compressed Sparse Row format>
<2000x3427 sparse matrix of type ''
    with 15853 stored elements in Compressed Sparse Row format>

How can I handle this difference? Every example on internet uses train_test_split and mine works okay on it but I want to do this first on validation then on a different test dataset, not only on train dataset.

Multinomial Naive Bayes can't use validation dataset because of ValueError but can use sklearn train_test_split

Answers (1)

Related Questions

Multinomial Naive Bayes can&#39;t use validation dataset because of ValueError but can use sklearn train_test_split

Answers (1)

Related Questions

Multinomial Naive Bayes can't use validation dataset because of ValueError but can use sklearn train_test_split