Reputation: 1307
I am classifying small texts (tweets) using Naive Bayes (MultinominalNB) in scikit-learn. My train data has 1000 features, and my test data has 1200 features. Let's say 500 features are common for both train and test data.
I wonder why MultinominalNB in scikit learn does not handle unseen features, and gives me an error:
Traceback (most recent call last):
File "/Users/osopova/Documents/00_KSU_Masters/01_2016_Spring/Twitter_project/mda_project_1/step_4.py", line 60, in <module>
predict_Y = classifiers[i].predict(test_X)
File "/Library/Python/2.7/site-packages/sklearn/naive_bayes.py", line 65, in predict
jll = self._joint_log_likelihood(X)
File "/Library/Python/2.7/site-packages/sklearn/naive_bayes.py", line 672, in _joint_log_likelihood
return (safe_sparse_dot(X, self.feature_log_prob_.T)
File "/Library/Python/2.7/site-packages/sklearn/utils/extmath.py", line 184, in safe_sparse_dot
return fast_dot(a, b)
ValueError: matrices are not aligned
Upvotes: 0
Views: 534
Reputation: 66775
It does not handle unseen features because you do not pass any reference naming features. Why do you have 1200 features in one case and 1000 in another? Probably because there were objects in the test setting not present in the training - but how Naive Bayes is supposed to figure out which ones of these 1200 are missing in 1000? In this implementation (which is the only possible when you assume arrays as input) it is your duty to remove all columns, which do not correspond to the ones in the training set, add columns of zeros (in valid spots) if it is the other way around, and most importantly - make sure that "ith" column in one set is the same (captures occurence of the same word/object) as "ith" column in the second one. Consequently in your case there are just 500 columns which can actually be used, and Naive Bayes has no information how to find these. You have to provide, in test scenario, the same 1000 features which were used in train, thus in your case it means removing 700 columns not seen during train, and adding (in valid spots!) 500 columns of zeros.
In particular, scikit-learn gives you plenty of data preprocessing utilities, which do this for you (like CountVectorizer etc.).
Upvotes: 2