RanX
RanX

Reputation: 75

Using Naive Bayes for spam detection

I have two files for e-mails some are spam and some are ham, I'm trying to train a classifier using Naive Bayes and then test it on a test set, I'm still trying to figure out how to do that

df = DataFrame()
train=data.sample(frac=0.8,random_state=20)
test=data.drop(train.index)
vectorizer = CountVectorizer()

counts = vectorizer.fit_transform(train['message'].values)
classifier = MultinomialNB()
targets = train['class'].values
classifier.fit(counts, targets)

testing_set = vectorizer.fit_transform(test['message'].values)
predictions = classifier.predict(testing_set)

I don't think it's the right way to do that and in addition to that, the last line is giving me an error.

ValueError: dimension mismatch

Upvotes: 0

Views: 563

Answers (1)

morsecodist
morsecodist

Reputation: 947

The idea behind CountVectorizer is that it creates a function that maps word counts to identical places in an array. For example this: a b a c might become [2, 1, 1]. When you call fit_transform it creates that index mapping A -> 0, B-> 1, C -> 2 and then applies that to create the vector of counts. Here you call fit_transform to create a count vectorizer for your training and then again for your testing set. Some words may be in your testing data and not your training data and these get added. To expand on the earlier example example, your test set might be d a b which would create a vector with dimension 4 to account for d. This is likely why the dimensions don't match.

To fix this don't use fit transform the second time so replace:

vectorizer.fit_transform(test['message'].values)

with:

vectorizer.transform(test['message'].values)

It is important to make your vectorizier from your training data not all of your data, which is tempting to avoid missing features. This makes your tests more accurate since when really using the model it will encounter unknown words.

This is no guarantee your approach will work but this is likely the source of the dimensionality issue.

Upvotes: 1

Related Questions