Reputation: 81
In the light of a project I've been playing with Python NLTK and Document Classification and the Naive Bayes classifier. As I understand from the documentation, this works very well if your different documents are tagged with either pos or neg as a label (or more than 2 labels)
The documents I'm working with that are already classified don't have labels, but they have a score, a floating point between 0 and 5.
What I would like to do is build a classifier, like the movies example in the documentation, but that would predict the score of a piece of text, rather than the label. I believe this is mentioned in the docs but never further explored as 'probabilities of numeric features'
I am not a language expert nor a statistician so if someone has an example of this lying around I would be most grateful if you would share this with me. Thanks!
Upvotes: 8
Views: 1317
Reputation: 303
This is a very late answer, but perhaps it will help someone.
What you're asking about is regression. Regarding Jacob's answer, linear regression is only one way to do it. However, I agree with his recommendation of scikit-learn.
Upvotes: 0
Reputation: 4182
What you're looking for is linear regression, and scikit-learn is much better than NLTK for this, see http://scikit-learn.org/stable/modules/linear_model.html
Upvotes: 1