Jules
Jules

Reputation: 105

Sentiment analysis training set

I am using NLTK python to do sentiment analysis and my data has about 200,000 reviews. To use Naive Bayes Classifier, I need to have training set that is labeled. Since my data is not labeled, I manually created about 100 reviews as positive and negative. But I don't think this is the way to do it. I heard that I need to have 20% of data as a training set to train classifier and apply it to the rest 80% of data.

Is there any better way to generate training set for Naive Bayes classifier? Thank you for your help, and please let me know if the questions is not clear to understand.

Upvotes: 2

Views: 714

Answers (1)

Eric J.
Eric J.

Reputation: 150108

We have had great success using only about 100-200 training samples (depending on the specific classification) to classify hundreds of thousands of paragraphs with a fairly high degree of accuracy.

We did hand-filter the randomly selected samples to ensure they are not very similar to each other (and therefore represent different ways to express a concept). We used RapidMiner for classification rather than NLTK, but I expect the algorithms are fairly similar.

Run your classifier with your 100 reviews, then run against a set of 100 random reviews not in the training set. Check the accuracy, and add more reviews to the training set if the accuracy is not where you want it to be.

Upvotes: 1

Related Questions