user1765949
user1765949

Reputation: 81

NLTK: Document Classification with numeric score instead of labels

In the light of a project I've been playing with Python NLTK and Document Classification and the Naive Bayes classifier. As I understand from the documentation, this works very well if your different documents are tagged with either pos or neg as a label (or more than 2 labels)

The documents I'm working with that are already classified don't have labels, but they have a score, a floating point between 0 and 5.

What I would like to do is build a classifier, like the movies example in the documentation, but that would predict the score of a piece of text, rather than the label. I believe this is mentioned in the docs but never further explored as 'probabilities of numeric features'

I am not a language expert nor a statistician so if someone has an example of this lying around I would be most grateful if you would share this with me. Thanks!

Upvotes: 8

Views: 1317

Answers (2)

Ethan Herdrick
Ethan Herdrick

Reputation: 303

This is a very late answer, but perhaps it will help someone.

What you're asking about is regression. Regarding Jacob's answer, linear regression is only one way to do it. However, I agree with his recommendation of scikit-learn.

Upvotes: 0

Jacob
Jacob

Reputation: 4182

What you're looking for is linear regression, and scikit-learn is much better than NLTK for this, see http://scikit-learn.org/stable/modules/linear_model.html

Upvotes: 1

Related Questions