Aditya Agarwal
Aditya Agarwal

Reputation: 513

Feedback in NaiveBayes Text Classification

I am a newbie in machine Learning, i am building a complaint categorizer and i want to provide a feedback model so that it can improve over time

import numpy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
value=[
'drought',
'robber',
]
targets=[
'water_department',
'police_department',
]
classifier = MultinomialNB()        
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(value)

classifier.partial_fit(counts[:1], targets[:1],classes=numpy.unique(targets))
for c,t in zip(counts[1:],targets[1:]):
    classifier.partial_fit(c, t.split())

value.append('dogs')                                   #new value to train
targets.append('animal_department')                    #new target
vectorize = CountVectorizer()
counts = vectorize.fit_transform(value)
print counts
print targets
print vectorize.vocabulary_
####problem lies here
classifier.partial_fit(counts["""dont know the index of new value"""], targets[-1])
####problem lies here

Even if i somehow find the index of newly inserted value, it is giving the error

ValueError: Number of features 3 does not match previous data 2.

even thought i made it to insert one value at a time

Upvotes: 0

Views: 157

Answers (1)

Debasis
Debasis

Reputation: 3750

I will try to answer the question from a general point of view. There are two sources of problem in the Naive Bayes (NB) approach described here:

  1. Out-of-vocabulary (OOV) problem
  2. Incremental training of NB

OOV problem: The simplest way to tackle the OOV problem is to decompose every word into character 3 grams. How many such 3-grams are possible? Assuming lower-casing there are only 26 possible ways to fill each place and hence the total number of possible character 3-grams is 26^3=17576, which is significantly lower than the number of possible English words that you're likely to see in text.

Hence, generally speaking, while training NB, a good idea is to use probabilities of character n-grams (n=3,4,5). This will drastically reduce the OOV problem.

Incremental training: For incremental training, given a new sentence decompose it into terms (character n-grams). Update the count of of each term for its corresponding observed class label. For example, if count(t,c) denotes how many times was the term t observed in class c, simply update the count if you see t in class 0 (or class 1) during incremental training. Updating the counts will update the maximum likelihood probability estimates as well.

Upvotes: 1

Related Questions