Reputation: 5198
I am implementing a naive bayes spam detector which features are words and I am not sure if I understand the algorithm correctly yet.
This I how I am trying to implement the algorithm:
In the training set I count how often a specific word from a text is present in a spam text and how often it is present in a nonspam text. I also store the total amount of spams and nonspams checked during the training.
Now after the training is done assume I have a new text T
which I want to classify.
I first assume the prior probabilites for spam (S) and nonspam (N) as:
P(S) = 0.5
P(N) = 0.5
Now I check every word W
which is contained in T
Assume a word W
was present 20 times in a spam text and 2 times in a non-spam text. The total number of spams checked is 50 and the total number of non-spams checked is 50 as well, so I have the posterior probabilites:
P(W|S) = 20 / 50
P(W|N) = 2 / 50
The calculated probabilites would then be
P(S|W) = P(W|S) * P(S) = 0.2
P(N|W) = P(W|N) * P(N) = 0.02
From this the algorithm would classify the text as spam.
What I having trouble understand is the following case:
Assume we have a Word W
which was present 0 times in a spam text, but 1 time in a non-spam text. in this case the posterior probability for spam would be
P(W|S) = O / 50 = 0
and therefore the whole probability would be 0 as well.
So this would mean whenever a word is present in a text which was never found in a spam text, but was found in a non-spam text the algorithm would classify the text as non-spam regardless of any other words and prior probabilty.
This is what confuses me and makes me think I havn't understood the algorithm correctly yet.
Upvotes: 0
Views: 128
Reputation: 527
You have to implement additive smoothing to take the non-dictionary words into account.
This additive smoothing will make the probability of a word that does not belong in the dictionary P(W|S) > 0.
This is the modified formula for the likelihood:
P(word|class) = sum ( I(word,D) + lambda) / |D_class| + lambda|V| ) on each document D belonging to all documents in the class.
where I(word,document)
is the indicator function which returns 1
if the document contains the word and 0
otherwise.
lambda
is a chosen constant
|D_class|
is the number of documents in the class
|V|
is the number of words in the dictionary or vocabulary
Upvotes: 1