asb
asb

Reputation: 4432

nltk: Text classification using custom feature set

I have a dataset that looks like this:

featureDict = {identifier1: [[first 3-gram], [second 3-gram], ... [last 3-gram]],
               ...
               identifierN: [[first 3-gram], [second 3-gram], ... [last 3-gram]]}

Plus I have a dict of labels for the same set of documents:

labelDict = {identifier1: label1,
             ...
             identifierN: labelN}

I want to figure out the most appropriate nltk container in which I can store this information in one place and seamlessly apply the nltk classifiers.

Additionally, before I use any classifiers on this dataset I'd also like to use a tf-idf filter on this features space.

References and documentation will be helpful.

Upvotes: 1

Views: 2768

Answers (1)

Viktor Vojnovski
Viktor Vojnovski

Reputation: 1371

You just need a simple dict. Have a look at the snippet in NLTK classify interface using trained classifier.

The reference documentation for this is still the nltk book: http://nltk.org/book/ch06.html and the API specs: http://nltk.org/api/nltk.classify.html

Here are some pages that might help you: http://snipperize.todayclose.com/snippet/py/Use-NLTK-Toolkit-to-Classify-Documents--5671027/, http://streamhacker.com/tag/feature-extraction/, http://web2dot5.wordpress.com/2012/03/21/text-classification-in-python/.

Also, have in mind that nltk is limited with regards to the classifier algorithms it provides. For more advanced exploration, you'd be better off using scikit-learn.

Upvotes: 1

Related Questions