Reputation: 75
I have two files for e-mails some are spam and some are ham, I'm trying to train a classifier using Naive Bayes and then test it on a test set, I'm still trying to figure out how to do that
df = DataFrame()
train=data.sample(frac=0.8,random_state=20)
test=data.drop(train.index)
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(train['message'].values)
classifier = MultinomialNB()
targets = train['class'].values
classifier.fit(counts, targets)
testing_set = vectorizer.fit_transform(test['message'].values)
predictions = classifier.predict(testing_set)
I don't think it's the right way to do that and in addition to that, the last line is giving me an error.
ValueError: dimension mismatch
Upvotes: 0
Views: 563
Reputation: 947
The idea behind CountVectorizer
is that it creates a function that maps word counts to identical places in an array. For example this: a b a c
might become [2, 1, 1]
. When you call fit_transform
it creates that index mapping A -> 0, B-> 1, C -> 2
and then applies that to create the vector of counts. Here you call fit_transform
to create a count vectorizer for your training and then again for your testing set. Some words may be in your testing data and not your training data and these get added. To expand on the earlier example example, your test set might be d a b
which would create a vector with dimension 4 to account for d
. This is likely why the dimensions don't match.
To fix this don't use fit transform the second time so replace:
vectorizer.fit_transform(test['message'].values)
with:
vectorizer.transform(test['message'].values)
It is important to make your vectorizier from your training data not all of your data, which is tempting to avoid missing features. This makes your tests more accurate since when really using the model it will encounter unknown words.
This is no guarantee your approach will work but this is likely the source of the dimensionality issue.
Upvotes: 1