Jakob Nielsen
Jakob Nielsen

Reputation: 5198

Using naive-bayes for detecting spam

I am implementing a naive bayes spam detector which features are words and I am not sure if I understand the algorithm correctly yet.

This I how I am trying to implement the algorithm:

In the training set I count how often a specific word from a text is present in a spam text and how often it is present in a nonspam text. I also store the total amount of spams and nonspams checked during the training.

Now after the training is done assume I have a new text T which I want to classify.

I first assume the prior probabilites for spam (S) and nonspam (N) as:

P(S) = 0.5 
P(N) = 0.5

Now I check every word W which is contained in T

Assume a word W was present 20 times in a spam text and 2 times in a non-spam text. The total number of spams checked is 50 and the total number of non-spams checked is 50 as well, so I have the posterior probabilites:

P(W|S) = 20 / 50
P(W|N) = 2 / 50

The calculated probabilites would then be

P(S|W) = P(W|S) * P(S) = 0.2
P(N|W) = P(W|N) * P(N) = 0.02

From this the algorithm would classify the text as spam.

What I having trouble understand is the following case:

Assume we have a Word W which was present 0 times in a spam text, but 1 time in a non-spam text. in this case the posterior probability for spam would be

P(W|S) = O / 50 = 0

and therefore the whole probability would be 0 as well.

So this would mean whenever a word is present in a text which was never found in a spam text, but was found in a non-spam text the algorithm would classify the text as non-spam regardless of any other words and prior probabilty.

This is what confuses me and makes me think I havn't understood the algorithm correctly yet.

Upvotes: 0

Views: 128

Answers (1)

John Bennedict Lorenzo
John Bennedict Lorenzo

Reputation: 527

You have to implement additive smoothing to take the non-dictionary words into account.

This additive smoothing will make the probability of a word that does not belong in the dictionary P(W|S) > 0.

This is the modified formula for the likelihood:

P(word|class) = sum ( I(word,D) + lambda) / |D_class| + lambda|V| ) on each document D belonging to all documents in the class.

where I(word,document) is the indicator function which returns 1 if the document contains the word and 0 otherwise.

lambda is a chosen constant

|D_class| is the number of documents in the class

|V| is the number of words in the dictionary or vocabulary

Upvotes: 1

Related Questions