covariance
covariance

Reputation: 7173

MultinomialNB: ValueError

In trying a Multinomial NB classifier on Kaggle's training/test sets, I get an odd ValueError. My (practice) goal is to just predict whether passengers are male or female based on their name, which goes into a CountVectorizer.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-97-ad69ee9ed02b> in <module>()
     12 
     13 # classifier prediction - for test set now!
---> 14 predict_Test = nb.predict(name_test)
     15 score = accuracy_score(sex_test, predictions, normalize=True)
     16 print score

C:\Users\Evan Chow\Anaconda\lib\site-packages\sklearn\naive_bayes.pyc in predict(self, X)
     61             Predicted target values for X
     62         """
---> 63         jll = self._joint_log_likelihood(X)
     64         return self.classes_[np.argmax(jll, axis=1)]
     65 

C:\Users\Evan Chow\Anaconda\lib\site-packages\sklearn\naive_bayes.pyc in _joint_log_likelihood(self, X)
    455         """Calculate the posterior log probability of the samples X"""
    456         X = atleast2d_or_csr(X)
--> 457         return (safe_sparse_dot(X, self.feature_log_prob_.T)
    458                 + self.class_log_prior_)
    459 

C:\Users\Evan Chow\Anaconda\lib\site-packages\sklearn\utils\extmath.pyc in safe_sparse_dot(a, b, dense_output)
     81         return ret
     82     else:
---> 83         return np.dot(a, b)
     84 
     85 

ValueError: matrices are not aligned

My code is:

rawDataTrain = pd.read_csv('train.csv')
trainData = pd.concat([rawDataTrain.Name, rawDataTrain.Sex], axis=1)

# get name, sex training data
cv_train = CountVectorizer(min_df = 0)
cv_train.fit(trainData.Name)
name_train = cv_train.transform(trainData.Name).toarray() # name_train
sex_train = np.asarray(trainData.Sex, dtype='S') # name_test

# get name, sex testing data
rawDataTest = pd.read_csv('test.csv')
testData = pd.concat([rawDataTest.Name, rawDataTest.Sex], axis=1)
cv_test = CountVectorizer(min_df = 0)
cv_test.fit(testData.Name)
name_test = cv_test.transform(testData.Name).toarray() # name test
sex_test = np.asarray(testData.Sex, dtype='S') # sex test

# classifier prediction - test quickly on training set. you should get 1.0
predictionsTrain = nb.predict(name_train)
scoreTrain = accuracy_score(sex_train, predictionsTrain, normalize=True)
print scoreTrain # returns probability of 1.0

# classifier prediction - this is what goes weird!
predict_Test = nb.predict(name_test)
score = accuracy_score(sex_test, predictions, normalize=True)
print score

Also, the dimensions of name_train, name_test, sex_train, and sex_test are:

(891, 1509) (418, 825) (891,) (418,)

It seems that the first coordinate of name_train and name_test need to be the same, but if that were true, prediction would only work on matrices with the same # of samples as the training set! Any thoughts on how to get rid of this ValueError?

Upvotes: 2

Views: 1472

Answers (1)

BrenBarn
BrenBarn

Reputation: 251368

They need to have the same second dimension (i.e., same number of columns). None of the shapes you give has the same second dimension as any other (except the two that don't have a second dimension at all, in which case it doesn't make much sense to use them to train or test).

Upvotes: 2

Related Questions