Reputation: 404
For studying purposes, I've tried to implement this "lesson" using python but "without" sckitlearn or something similar.
My attempt code is the follow:
import pandas, math
training_data = [
['A great game','Sports'],
['The election was over','Not sports'],
['Very clean match','Sports'],
['A clean but forgettable game','Sports'],
['It was a close election','Not sports']
]
text_to_predict = 'A very close game'
data_frame = pandas.DataFrame(training_data, columns=['data','label'])
data_frame = data_frame.applymap(lambda s:s.lower() if type(s) == str else s)
text_to_predict = text_to_predict.lower()
labels = data_frame.label.unique()
word_frequency = data_frame.data.str.split(expand=True).stack().value_counts()
unique_words_set = set()
unique_words = data_frame.data.str.split().apply(unique_words_set.update)
total_unique_words = len(unique_words_set)
word_frequency_per_labels = []
for l in labels:
word_frequency_per_label = data_frame[data_frame.label == l].data.str.split(expand=True).stack().value_counts()
for w, f in word_frequency_per_label.iteritems():
word_frequency_per_labels.append([w,f,l])
word_frequency_per_labels_df = pandas.DataFrame(word_frequency_per_labels, columns=['word','frequency','label'])
laplace_smoothing = 1
results = []
for l in labels:
p = []
total_words_in_label = word_frequency_per_labels_df[word_frequency_per_labels_df.label == l].frequency.sum()
for w in text_to_predict.split():
x = (word_frequency_per_labels_df.query('word == @w and label == @l').frequency.to_list()[:1] or [0])[0]
p.append((x + laplace_smoothing) / (total_words_in_label + total_unique_words))
results.append([l,math.prod(p)])
print(results)
result = pandas.DataFrame(results, columns=['labels','posterior']).sort_values('posterior',ascending = False).labels.iloc[0]
print(result)
In the blog lesson their results are:
But my result were:
[['sports', 4.607999999999999e-05], ['not sports', 1.4293831139825827e-05]]
So, what did I do wrong in my python implementation? How can I get the same results?
Thanks in advance
Upvotes: 2
Views: 120
Reputation: 1473
the answer by @nick is correct and should be awarded the bounty.
Here an alternative implementation (from scratch, not using pandas) that also supports normalization of probabilities and words not in training set
from collections import defaultdict
from dataclasses import dataclass, field
from typing import Dict, Set
def tokenize(text: str):
return [word.lower() for word in text.split()]
def normalize(result: Dict[str, float]):
total = sum([v for v in result.values()])
for k in result.keys():
result[k] /= total
@dataclass
class Model:
labels: Set[str] = field(default_factory=set)
words: Set[str] = field(default_factory=set)
prob_labels: Dict[str,float] = field(default_factory=lambda: defaultdict(float)) # P(label)
prob_words: Dict[str,Dict[str,float]] = field(default_factory=lambda: defaultdict(lambda: defaultdict(float))) # P(word | label) as prob_words[label][word]
def predict(self, text: str, norm=True) -> Dict[str, float]: # P(label | text) as model.predict(text)[label]
result = {label: self.prob_labels[label] for label in self.labels}
for word in tokenize(text):
for label in self.labels:
if word in self.words:
result[label] *= self.prob_words[label][word]
if norm:
normalize(result)
return result
def train(self, data):
prob_words_denominator = defaultdict(int)
for row in data:
text = row[0]
label = row[1].lower()
self.labels.add(label)
self.prob_labels[label] += 1.0
for word in tokenize(text):
self.words.add(word)
self.prob_words[label][word] += 1.0
prob_words_denominator[label] += 1.0
for label in self.labels:
self.prob_labels[label] /= len(data)
for word in self.words:
self.prob_words[label][word] = (self.prob_words[label][word] + 1.0) / (prob_words_denominator[label] + len(self.words))
training_data = [
['A great game','Sports'],
['The election was over','Not sports'],
['Very clean match','Sports'],
['A clean but forgettable game','Sports'],
['It was a close election','Not sports']
]
text_to_predict = 'A very close game'
model = Model()
model.train(training_data)
print(model.predict(text_to_predict, norm=False))
print(model.predict(text_to_predict))
print(model.predict("none of these words is in training data"))
output:
{'sports': 2.7647999999999997e-05, 'not sports': 5.7175324559303314e-06}
{'sports': 0.8286395560004286, 'not sports': 0.1713604439995714}
{'sports': 0.6, 'not sports': 0.4}
Upvotes: 2
Reputation: 1350
You haven't multiplied by the priors p(Sport) = 3/5
and p(Not Sport) = 2/5
. So just updating your answers by these ratios will get you to the correct result. Everything else looks good.
So for example you implement p(a|Sports) x p(very|Sports) x p(close|Sports) x p(game|Sports)
in your math.prod(p)
calculation but this ignores the term p(Sport)
. So adding this in (and doing the same for the not sport condition) fixes things.
In code this can be achieved by:
prior = (data_frame.label == l).mean()
results.append([l,prior*math.prod(p)])
Upvotes: 2