Classifying text documents using nltk

Question

I'm currently working on a project where I'm taking emails, stripping out the message bodies using the email package, then I want to categorize them using labels like sports, politics, technology, etc...

I've successfully stripped the message bodies out of my emails, now I'm looking to start classifying. I've done the classic example of sentiment-analysis classification using the move_reviews corpus separating documents into positive and negative reviews.

I'm just wondering how I could apply this approach to my project? Can I create multiple classes like sports, technology, politics, entertainment, etc.? I have hit a road block here and am looking for a push in the right direction.

If this isn't an appropriate question for SO I'll happily delete it.

Edit: Hello everyone, I see that this post has gained a bit of popularity, I did end up successfully completing this project, here is a link to the code in the projects GitHub Repo: https://github.com/codyreandeau/Email-Categorizer/blob/master/Email_Categorizer.py

nmlq · Accepted Answer

To create a classifier, you need a training data set with the classes you are looking for. In your case, you may need to either:

create your own data set
use a pre-existing dataset

The brown corpus is a seminal text with many of the categories you are speaking about. This could be a starting point to help classify your emails using some package like gensim to find semantically similar texts.

Once you classify your emails, you can then train a system to predict a label for each unseen email.

Classifying text documents using nltk

Answers (2)

Related Questions