Reputation: 7173
In trying a Multinomial NB classifier on Kaggle's training/test sets, I get an odd ValueError. My (practice) goal is to just predict whether passengers are male or female based on their name, which goes into a CountVectorizer.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-97-ad69ee9ed02b> in <module>()
12
13 # classifier prediction - for test set now!
---> 14 predict_Test = nb.predict(name_test)
15 score = accuracy_score(sex_test, predictions, normalize=True)
16 print score
C:\Users\Evan Chow\Anaconda\lib\site-packages\sklearn\naive_bayes.pyc in predict(self, X)
61 Predicted target values for X
62 """
---> 63 jll = self._joint_log_likelihood(X)
64 return self.classes_[np.argmax(jll, axis=1)]
65
C:\Users\Evan Chow\Anaconda\lib\site-packages\sklearn\naive_bayes.pyc in _joint_log_likelihood(self, X)
455 """Calculate the posterior log probability of the samples X"""
456 X = atleast2d_or_csr(X)
--> 457 return (safe_sparse_dot(X, self.feature_log_prob_.T)
458 + self.class_log_prior_)
459
C:\Users\Evan Chow\Anaconda\lib\site-packages\sklearn\utils\extmath.pyc in safe_sparse_dot(a, b, dense_output)
81 return ret
82 else:
---> 83 return np.dot(a, b)
84
85
ValueError: matrices are not aligned
My code is:
rawDataTrain = pd.read_csv('train.csv')
trainData = pd.concat([rawDataTrain.Name, rawDataTrain.Sex], axis=1)
# get name, sex training data
cv_train = CountVectorizer(min_df = 0)
cv_train.fit(trainData.Name)
name_train = cv_train.transform(trainData.Name).toarray() # name_train
sex_train = np.asarray(trainData.Sex, dtype='S') # name_test
# get name, sex testing data
rawDataTest = pd.read_csv('test.csv')
testData = pd.concat([rawDataTest.Name, rawDataTest.Sex], axis=1)
cv_test = CountVectorizer(min_df = 0)
cv_test.fit(testData.Name)
name_test = cv_test.transform(testData.Name).toarray() # name test
sex_test = np.asarray(testData.Sex, dtype='S') # sex test
# classifier prediction - test quickly on training set. you should get 1.0
predictionsTrain = nb.predict(name_train)
scoreTrain = accuracy_score(sex_train, predictionsTrain, normalize=True)
print scoreTrain # returns probability of 1.0
# classifier prediction - this is what goes weird!
predict_Test = nb.predict(name_test)
score = accuracy_score(sex_test, predictions, normalize=True)
print score
Also, the dimensions of name_train, name_test, sex_train, and sex_test are:
(891, 1509) (418, 825) (891,) (418,)
It seems that the first coordinate of name_train and name_test need to be the same, but if that were true, prediction would only work on matrices with the same # of samples as the training set! Any thoughts on how to get rid of this ValueError?
Upvotes: 2
Views: 1472
Reputation: 251368
They need to have the same second dimension (i.e., same number of columns). None of the shapes you give has the same second dimension as any other (except the two that don't have a second dimension at all, in which case it doesn't make much sense to use them to train or test).
Upvotes: 2