Reputation: 4099
I have a theoretical question about a Naive Bayes Classifier. Assume I have trained the classifier with the following training data:
class word count
-----------------
pos good 1
sun 1
neu tree 1
neg bad 1
sad 1
Assume I now classify "good sun great". There are now two options:
1) classify against the trainingdata, which remains static. Meaning both "good" and "sun" come from the positive category, classifying this string as a positive. After classification, the training table remains unchanged. All strings are thus classified against the static set of training data.
2) You classify the string, but then update the training data, as in the table underneath. Thus, the next string will be classified against a more "advanced" set of training data than this one. By the end of (automatic) classification, the table that started out as a simple training set, will have grown in size, having been expanded with many words (and updated word counts)
class word count
-----------------
pos good 2
sun 2
great 1
neu tree 1
neg bad 1
sad 1
In my implementation of NMB I used the first method, but I'm now second-guessing I should have done the latter. Please enlighten me :-)
Upvotes: 0
Views: 917
Reputation: 3032
The method you've implemented is indeed the popular and accepted way of building classifiers (and not just Bayesian ones).
Using "unlabeled" data, i.e. data you have no ground-truth about, to update the classifier, is a more advanced and complicated technique, sometimes called "semi-supervised learning". Using this class of algorithms might or might not be a good fit to your specific task - it's usually a matter of trial and error.
If you do decide to incorporate unlabeled data into your model, you should probably try out one of the popular algorithms of doing that, e.g. EM.
Upvotes: 1