Uylenburgh
Uylenburgh

Reputation: 1307

Naive Bayes unseen features handling scikit learn

I am classifying small texts (tweets) using Naive Bayes (MultinominalNB) in scikit-learn. My train data has 1000 features, and my test data has 1200 features. Let's say 500 features are common for both train and test data.

I wonder why MultinominalNB in scikit learn does not handle unseen features, and gives me an error:

Traceback (most recent call last):
  File "/Users/osopova/Documents/00_KSU_Masters/01_2016_Spring/Twitter_project/mda_project_1/step_4.py", line 60, in <module>
    predict_Y = classifiers[i].predict(test_X)
  File "/Library/Python/2.7/site-packages/sklearn/naive_bayes.py", line 65, in predict
    jll = self._joint_log_likelihood(X)
  File "/Library/Python/2.7/site-packages/sklearn/naive_bayes.py", line 672, in _joint_log_likelihood
    return (safe_sparse_dot(X, self.feature_log_prob_.T)
  File "/Library/Python/2.7/site-packages/sklearn/utils/extmath.py", line 184, in safe_sparse_dot
    return fast_dot(a, b)
ValueError: matrices are not aligned

Upvotes: 0

Views: 534

Answers (1)

lejlot
lejlot

Reputation: 66775

It does not handle unseen features because you do not pass any reference naming features. Why do you have 1200 features in one case and 1000 in another? Probably because there were objects in the test setting not present in the training - but how Naive Bayes is supposed to figure out which ones of these 1200 are missing in 1000? In this implementation (which is the only possible when you assume arrays as input) it is your duty to remove all columns, which do not correspond to the ones in the training set, add columns of zeros (in valid spots) if it is the other way around, and most importantly - make sure that "ith" column in one set is the same (captures occurence of the same word/object) as "ith" column in the second one. Consequently in your case there are just 500 columns which can actually be used, and Naive Bayes has no information how to find these. You have to provide, in test scenario, the same 1000 features which were used in train, thus in your case it means removing 700 columns not seen during train, and adding (in valid spots!) 500 columns of zeros.

In particular, scikit-learn gives you plenty of data preprocessing utilities, which do this for you (like CountVectorizer etc.).

Upvotes: 2

Related Questions