Reputation: 3199
EDIT: The correct version of the code which works can be found at : https://github.com/a7x/NaiveBayes-Classifier
I used data from openClassroom and started working on a small version of Naive Bayes in Python. Steps were the usual training and then prediction . I have a few questions and want to know why the accuracy is quite bad.
For training, I calculated the log likelihood by the formula :
log( P ( word | spam ) +1 ) /( spamSize + vocabSize .)
My question is: why did we add the vocabSize
in this case :( and is this the correct way of going about it? Code used is below:
#This is for training. Calculate all probabilities and store them in a vector. Better to store it in a file for easier access
from __future__ import division
import sys,os
'''
1. The spam and non-spam is already 50% . So they by default are 0.5
2. Now we need to calculate probability of each word , in spam and non-spam separately
2.1 we can make two dictionaries, defaultdicts basically, for spam and non-spam
2.2 When time comes to calculate probabilities, we just need to substitute values
'''
from collections import *
from math import *
spamDict = defaultdict(int)
nonspamDict = defaultdict(int)
spamFolders = ["spam-train"]
nonspamFolders = ["nonspam-train"]
path = sys.argv[1] #Base path
spamVector = open(sys.argv[2],'w') #WRite all spam values into this
nonspamVector = open(sys.argv[3],'w') #Non-spam values
#Go through all files in spam and iteratively add values
spamSize = 0
nonspamSize = 0
vocabSize = 264821
for f in os.listdir(os.path.join(path,spamFolders[0])):
data = open(os.path.join(path,spamFolders[0],f),'r')
for line in data:
words = line.split(" ")
spamSize = spamSize + len(words)
for w in words:
spamDict[w]+=1
for f in os.listdir(os.path.join(path,nonspamFolders[0])):
data = open(os.path.join(path,nonspamFolders[0],f),'r')
for line in data:
words = line.split(" ")
nonspamSize = nonspamSize + len(words)
for w in words:
nonspamDict[w]+=1
logProbspam = {}
logProbnonSpam = {} #This is to store the log probabilities
for k in spamDict.keys():
#Need to calculate P(x | y = 1)
numerator = spamDict[k] + 1 # Frequency
print 'Word',k,' frequency',spamDict[k]
denominator = spamSize + vocabSize
p = log(numerator/denominator)
logProbspam[k] = p
for k in nonspamDict.keys():
numerator = nonspamDict[k] + 1 #frequency
denominator = nonspamSize + vocabSize
p = log(numerator/denominator)
logProbnonSpam[k] = p
for k in logProbnonSpam.keys():
nonspamVector.write(k+" "+str(logProbnonSpam[k])+"\n")
for k in logProbspam.keys():
spamVector.write(k+" "+str(logProbspam[k])+"\n")
For prediction, I just took a mail , split it into words, added all the probabilities, separately for spam/non-spam, and multiplied them by 0.5. Whichever was higher was the class label. Code is below:
http://pastebin.com/8Y6Gm2my ( Stackoverflow was again playing games for some reason :-/)
EDIT : I have removed the spam = spam + 1 thing. Instead, I just ignore those words
Problem : My accuracy is quite bad . As noted below.
No of files in spam is 130
No. of spam in ../NaiveBayes/spam-test is 53 no. of non-spam 77
No of files in non-spam is 130
No. of spam in ../NaiveBayes/nonspam-test/ is 6 no. of non-spam 124
Please tell me where all I am going wrong. I think an accuracy of less than 50% means there must be some glaring error(s) in the implementation.
Upvotes: 3
Views: 2700
Reputation: 1206
There are multiple errors and bad assumptions in your program - in both parts of it. Here are several.
1
which you add, but log(1/(spamSize+vocabSize))
or log(1/(nonspamSize+vocabSize))
depending on its group. This is actually very important - you need to store this probability with your data for the program to function correctly.Update. You have fixed 6. Also 1 is not strictly nessesary to fix while you're working with this dataset, as well as 3 is not required.
Your modification did not correctly fix either 4 or 5. First, if the word has never been observed in some set, the probability of the message in it should decrease. Ignoring the word is not a good idea, you need to account for it as a highly unprobable one.
Second, your current code is asymmetric as the word being absent in spam cancels the check for non-spam (but not the other way). If you need to do nothing in an exception handler, use pass
, not continue
, as the latter immediately goes to the next for w in words:
iteration.
The problem number 2 is also still in place - the vocabulary size you use does not match the real one. It must be the number of different words observed in the training set, not the total number of words in all the messages together.
Upvotes: 2
Reputation: 363487
Here's at least one of the errors you're making: you're storing log-probabilities in your model file (as you should), but then in the prediction code you're pretending that they are straight probabilities:
totalSpamP = spamP * 0.5
should be
totalSpamP = spamP + math.log(0.5)
Also, I don't get what this line is doing:
spamP = spamP + 1
It seems to be making up for a feature not found in the spam part of the training set, but those words should simply be ignored. Right now, it's adding e (exp(1)
) to a probability, which by definition is invalid.
(As an aside, I just tried classification on this training set with my own implementation of Naive Bayes and got 97.6% accuracy, so that's the figure you should be aiming at :)
Upvotes: 2