seanlorenz
seanlorenz

Reputation: 323

Dealing with negative values in sklearn MultinomialNB

I am normalizing my text input before running MultinomialNB in sklearn like this:

vectorizer = TfidfVectorizer(max_df=0.5, stop_words='english', use_idf=True)
lsa = TruncatedSVD(n_components=100)
mnb = MultinomialNB(alpha=0.01)

train_text = vectorizer.fit_transform(raw_text_train)
train_text = lsa.fit_transform(train_text)
train_text = Normalizer(copy=False).fit_transform(train_text)

mnb.fit(train_text, train_labels)

Unfortunately, MultinomialNB does not accept the non-negative values created during the LSA stage. Any ideas for getting around this?

Upvotes: 19

Views: 41502

Answers (4)

yeong wee ping
yeong wee ping

Reputation: 1

I had the same isse running on NB, and indeed using sklearn.preprocessing.MinMaxScaler() suggested by gobrewers14 works. But it actually reduced the performance accuracy on my Decision Tree, Random Forest and KNN by 0.2% from the same standardized dataset.

Upvotes: 0

Rakshit Sinha
Rakshit Sinha

Reputation: 11

Try creating a pipeline with Normalization as the first step and model fitting as the second step.

from sklearn.preprocessing import MinMaxScaler
p = Pipeline([('Normalizing',MinMaxScaler()),('MultinomialNB',MultinomialNB())])
p.fit(X_train,y_train) 

Upvotes: 0

Roaa
Roaa

Reputation: 21

Try to do this in fit()

train_text.np.todense() 

Upvotes: 0

Martin Forte
Martin Forte

Reputation: 873

I recommend you that don't use Naive Bayes with SVD or other matrix factorization because Naive Bayes based on applying Bayes' theorem with strong (naive) independence assumptions between the features. Use other classifier, for example RandomForest

I tried this experiment with this results:

vectorizer = TfidfVectorizer(max_df=0.5, stop_words='english', use_idf=True)
lsa = NMF(n_components=100)
mnb = MultinomialNB(alpha=0.01)

train_text = vectorizer.fit_transform(raw_text_train)
train_text = lsa.fit_transform(train_text)
train_text = Normalizer(copy=False).fit_transform(train_text)

mnb.fit(train_text, train_labels)

This is the same case but I'm using NMP(non-negative matrix factorization) instead SVD and got 0,04% accuracy.

Changing the classifier MultinomialNB for RandomForest i got 79% accuracy.

Therefore change the classifier or don't apply a matrix factorization.

Upvotes: 8

Related Questions