Reputation: 81
I'm now making the nltk_classifier classifying sentence's category.
So I already trained classifier using 11000 sentences' featuresets.
train_set, test_set = featuresets[1000:], featuresets[:1000]
classifier = naivebayes.NaiveBayesClassifier.train(train_set)
But I want to add more (sentence,category) featuresets for upgrading classifier. The only way I know is that I append featuresets to list of alreay learned featuresets. That way would make new classifier. But I think that this method is not efficient because It took a lot of time to train one or less more sentence.
Is there any good way to improve classifier's quality by adding featuresets???
Upvotes: 0
Views: 186
Reputation: 16114
Two things.
Naive Bayes is usually super fast. It only visits all your training data for one time and accumulates the feature-class co-occurrence stats. After that, it uses that stats to build the model. Usually it's not a problem to just re-train your model with new (incremental) data.
It's doable to not redo the steps above when new data comes as long as you still have the feature-class stats stored somewhere. Now you just visit the new data the same way as you did in step 1 and keep updating the feature-class co-occurrence stats. At the end of day, you have new numerators (m
) and denominators (n
), which applies to both class priors P(C)
and the probability of feature given a class P(W|C)
. You could derive the probabilities by m/n
.
Friendly reminder of Bayesian formulas in document classification:
-- Given a document D
, the probability that the document falls in category of C_j
is:
P(C_j|D) = P(D|C_j)*P(C_j)/P(D)
-- That probability is proportional to:
P(C_j|D) ~ P(W1|C_j) P(W2|C_j) ... P(Wk|C_j) * P(C_j)
based on:
W1, W2, ..., Wk
in the doc are independent), P(D)
because every class have the same P(D)
as denominator (thus we say proportional not equal to).-- Now all probabilities on the right side could be computed by a corresponding fraction (m/n
), where m
and n
are stored (or can be derived) in the feature-class co-occurrence matrix.
Upvotes: 1