Implementation of Naive Bayes - accuracy issues

Question

EDIT: The correct version of the code which works can be found at : https://github.com/a7x/NaiveBayes-Classifier

I used data from openClassroom and started working on a small version of Naive Bayes in Python. Steps were the usual training and then prediction . I have a few questions and want to know why the accuracy is quite bad.

For training, I calculated the log likelihood by the formula :

log( P ( word | spam ) +1 ) /( spamSize + vocabSize .)

My question is: why did we add the vocabSize in this case :( and is this the correct way of going about it? Code used is below:

#This is for training.     Calculate all probabilities and store them in a vector. Better to store it in a file  for easier access 
from __future__ import division
import sys,os
''' 
1. The spam and non-spam is already 50%  . So they by default are 0.5
2. Now we need to calculate probability of each word    , in spam and non-spam separately
  2.1  we can make two dictionaries, defaultdicts basically,  for spam and non-spam 
  2.2 When time comes to calculate probabilities, we just need to substitute values
'''
from collections import *
from math import *

spamDict = defaultdict(int)
nonspamDict = defaultdict(int)
spamFolders = ["spam-train"]
nonspamFolders = ["nonspam-train"]
path = sys.argv[1] #Base path
spamVector = open(sys.argv[2],'w') #WRite all spam values into this 
nonspamVector = open(sys.argv[3],'w') #Non-spam values

#Go through all files in spam and  iteratively add values
spamSize = 0
nonspamSize = 0
vocabSize = 264821
for f in os.listdir(os.path.join(path,spamFolders[0])):
    data = open(os.path.join(path,spamFolders[0],f),'r')

    for line in data:
        words = line.split(" ")
        spamSize = spamSize + len(words)
        for w in words:
            spamDict[w]+=1

for f in os.listdir(os.path.join(path,nonspamFolders[0])):
    data = open(os.path.join(path,nonspamFolders[0],f),'r')
    for line in data:
        words = line.split(" ")
        nonspamSize = nonspamSize + len(words)
        for w in words:

            nonspamDict[w]+=1
logProbspam = {}
logProbnonSpam = {} #This is to store the log probabilities
for k in spamDict.keys():
    #Need to calculate P(x | y = 1)

    numerator =  spamDict[k] + 1  # Frequency
    print 'Word',k,' frequency',spamDict[k]
    denominator = spamSize + vocabSize
    p = log(numerator/denominator)
    logProbspam[k] = p
for k in nonspamDict.keys():
    numerator = nonspamDict[k] + 1 #frequency
    denominator = nonspamSize + vocabSize
    p = log(numerator/denominator)
    logProbnonSpam[k] = p

for k in logProbnonSpam.keys():
    nonspamVector.write(k+" "+str(logProbnonSpam[k])+"
")
for k in logProbspam.keys():
    spamVector.write(k+" "+str(logProbspam[k])+"
")

For prediction, I just took a mail , split it into words, added all the probabilities, separately for spam/non-spam, and multiplied them by 0.5. Whichever was higher was the class label. Code is below:

http://pastebin.com/8Y6Gm2my ( Stackoverflow was again playing games for some reason :-/)

EDIT : I have removed the spam = spam + 1 thing. Instead, I just ignore those words

Problem : My accuracy is quite bad . As noted below.

    No of files in spam is 130
    No. of spam in  ../NaiveBayes/spam-test  is  53  no. of non-spam 77
    No of files in non-spam is 130
    No. of spam in  ../NaiveBayes/nonspam-test/  is  6  no. of non-spam 124

Please tell me where all I am going wrong. I think an accuracy of less than 50% means there must be some glaring error(s) in the implementation.

Tanriol · Accepted Answer

There are multiple errors and bad assumptions in your program - in both parts of it. Here are several.

You hardcode into your program the fact that you have the same number of spam and non-spam emails. I'd recommend not to hardcode this assumption. This is not absolutely essential, but in a more general case you will need to remove it.
You hadrcode into your program some number that you treat as the vocabulary size. I'd not recommend doing this as this number may change on any modification of the training set. Furthermore, actually it's incorrect. I'd recommend to calculate it during learning.
This may be not a mistake, but you seem to have a vocabulary of all the words in the training set. This may be suboptimal; actually the page you refer to recommends to take into account only the top-2500 words over all the emails. However, that's not essential for obtaining correct results - even without this filtering, my implementation is getting only several emails unclassified.
You incorrectly account for words that have been observed in spam or non-spam only. The log-probability for them being found in the other subset is not 1 which you add, but log(1/(spamSize+vocabSize)) or log(1/(nonspamSize+vocabSize)) depending on its group. This is actually very important - you need to store this probability with your data for the program to function correctly.
You do not ignore words never observed in the training set. Actually these may be treated in different ways, but you should take them into account.
Due to incorrect indentation in the prediction function, you predict using not the whole message, but only the first line of the message. Just a programming bug.

Update. You have fixed 6. Also 1 is not strictly nessesary to fix while you're working with this dataset, as well as 3 is not required.
Your modification did not correctly fix either 4 or 5. First, if the word has never been observed in some set, the probability of the message in it should decrease. Ignoring the word is not a good idea, you need to account for it as a highly unprobable one.
Second, your current code is asymmetric as the word being absent in spam cancels the check for non-spam (but not the other way). If you need to do nothing in an exception handler, use pass, not continue, as the latter immediately goes to the next for w in words: iteration.
The problem number 2 is also still in place - the vocabulary size you use does not match the real one. It must be the number of different words observed in the training set, not the total number of words in all the messages together.

Implementation of Naive Bayes - accuracy issues

Answers (2)

Related Questions