celsowm
celsowm

Reputation: 404

Difficulties to get the correct posterior value in a Naive Bayes Implementation

For studying purposes, I've tried to implement this "lesson" using python but "without" sckitlearn or something similar.

My attempt code is the follow:

import pandas, math

training_data = [
        ['A great game','Sports'],
        ['The election was over','Not sports'],
        ['Very clean match','Sports'],
        ['A clean but forgettable game','Sports'],
        ['It was a close election','Not sports']
]

text_to_predict = 'A very close game'
data_frame = pandas.DataFrame(training_data, columns=['data','label'])
data_frame = data_frame.applymap(lambda s:s.lower() if type(s) == str else s)
text_to_predict = text_to_predict.lower()
labels = data_frame.label.unique()
word_frequency = data_frame.data.str.split(expand=True).stack().value_counts()
unique_words_set = set()
unique_words = data_frame.data.str.split().apply(unique_words_set.update)
total_unique_words = len(unique_words_set)

word_frequency_per_labels = []
for l in labels:
    word_frequency_per_label = data_frame[data_frame.label == l].data.str.split(expand=True).stack().value_counts()
    for w, f in word_frequency_per_label.iteritems():
        word_frequency_per_labels.append([w,f,l])

word_frequency_per_labels_df = pandas.DataFrame(word_frequency_per_labels, columns=['word','frequency','label'])
laplace_smoothing = 1
results = []
for l in labels:
    p = []
    total_words_in_label = word_frequency_per_labels_df[word_frequency_per_labels_df.label == l].frequency.sum()
    for w in text_to_predict.split():
        x = (word_frequency_per_labels_df.query('word == @w and label == @l').frequency.to_list()[:1] or [0])[0]
        p.append((x + laplace_smoothing) / (total_words_in_label + total_unique_words))
    results.append([l,math.prod(p)])

print(results)
result = pandas.DataFrame(results, columns=['labels','posterior']).sort_values('posterior',ascending = False).labels.iloc[0]
print(result)

In the blog lesson their results are:

enter image description here

But my result were:

[['sports', 4.607999999999999e-05], ['not sports', 1.4293831139825827e-05]]

So, what did I do wrong in my python implementation? How can I get the same results?

Thanks in advance

Upvotes: 2

Views: 120

Answers (2)

pietroppeter
pietroppeter

Reputation: 1473

the answer by @nick is correct and should be awarded the bounty.

Here an alternative implementation (from scratch, not using pandas) that also supports normalization of probabilities and words not in training set

from collections import defaultdict
from dataclasses import dataclass, field
from typing import Dict, Set

def tokenize(text: str):
    return [word.lower() for word in text.split()]

def normalize(result: Dict[str, float]):
    total = sum([v for v in result.values()])
    for k in result.keys():
        result[k] /= total

@dataclass
class Model:
    labels: Set[str] = field(default_factory=set)
    words: Set[str] = field(default_factory=set)
    prob_labels: Dict[str,float] = field(default_factory=lambda: defaultdict(float)) # P(label)
    prob_words: Dict[str,Dict[str,float]] = field(default_factory=lambda: defaultdict(lambda: defaultdict(float)))  # P(word | label) as prob_words[label][word]

    
    def predict(self, text: str, norm=True) -> Dict[str, float]: # P(label | text) as model.predict(text)[label]
        result = {label: self.prob_labels[label] for label in self.labels}
        for word in tokenize(text):
            for label in self.labels:
                if word in self.words:
                    result[label] *= self.prob_words[label][word]
        if norm:
            normalize(result)
        return result

    def train(self, data):
        prob_words_denominator = defaultdict(int)
        for row in data:
            text = row[0]
            label = row[1].lower()
            self.labels.add(label)
            self.prob_labels[label] += 1.0
            for word in tokenize(text):
                self.words.add(word)
                self.prob_words[label][word] += 1.0
                prob_words_denominator[label] += 1.0
        for label in self.labels:
            self.prob_labels[label] /= len(data)
            for word in self.words:
                self.prob_words[label][word] = (self.prob_words[label][word] + 1.0) / (prob_words_denominator[label] + len(self.words))
            
            
training_data = [
        ['A great game','Sports'],
        ['The election was over','Not sports'],
        ['Very clean match','Sports'],
        ['A clean but forgettable game','Sports'],
        ['It was a close election','Not sports']
]

text_to_predict = 'A very close game'

model = Model()
model.train(training_data)
print(model.predict(text_to_predict, norm=False))
print(model.predict(text_to_predict))
print(model.predict("none of these words is in training data"))

output:

{'sports': 2.7647999999999997e-05, 'not sports': 5.7175324559303314e-06}
{'sports': 0.8286395560004286, 'not sports': 0.1713604439995714}
{'sports': 0.6, 'not sports': 0.4}

Upvotes: 2

nick
nick

Reputation: 1350

You haven't multiplied by the priors p(Sport) = 3/5 and p(Not Sport) = 2/5. So just updating your answers by these ratios will get you to the correct result. Everything else looks good.

So for example you implement p(a|Sports) x p(very|Sports) x p(close|Sports) x p(game|Sports) in your math.prod(p) calculation but this ignores the term p(Sport). So adding this in (and doing the same for the not sport condition) fixes things.

In code this can be achieved by:

prior = (data_frame.label == l).mean()
results.append([l,prior*math.prod(p)])

Upvotes: 2

Related Questions