Cody
Cody

Reputation: 484

Classifying text documents using nltk

I'm currently working on a project where I'm taking emails, stripping out the message bodies using the email package, then I want to categorize them using labels like sports, politics, technology, etc...

I've successfully stripped the message bodies out of my emails, now I'm looking to start classifying. I've done the classic example of sentiment-analysis classification using the move_reviews corpus separating documents into positive and negative reviews.

I'm just wondering how I could apply this approach to my project? Can I create multiple classes like sports, technology, politics, entertainment, etc.? I have hit a road block here and am looking for a push in the right direction.

If this isn't an appropriate question for SO I'll happily delete it.

Edit: Hello everyone, I see that this post has gained a bit of popularity, I did end up successfully completing this project, here is a link to the code in the projects GitHub Repo: https://github.com/codyreandeau/Email-Categorizer/blob/master/Email_Categorizer.py

Upvotes: 2

Views: 5786

Answers (2)

bogs
bogs

Reputation: 2296

The task of text classification is a Supervised Machine Learning problem. This means that you need to have labelled data. When you approached the movie_review problem, you used the +1/-1 labels to train your sentiment analysis system.

Getting back to your problem:

  1. If you have labels for your data, approach the problem in the same manner. I suggest you use the scikit-learn library. You can draw some inspiration from here: Scikit-Learn for Text Classification

  2. If you don't have labels, you can try an unsupervised learning approach. If you have any clue about how many categories(call the number K) you have, you can try a KMeans approach. This means, grouping the emails in K categories based on how similar they are. Similar emails will end up in similar buckets. Then inspect the clusters by hand and come up with a label. Assign new emails to the most similar cluster. If you need help with KMeans check this quick recipe: Text Clustering Recipe

Suggestion: Getting labels for emails can be easier than you think. For example, Gmail lets you export your emails with folder information. If you have categorised your email, you can take advantage of this.

Upvotes: 3

nmlq
nmlq

Reputation: 3154

To create a classifier, you need a training data set with the classes you are looking for. In your case, you may need to either:

  1. create your own data set
  2. use a pre-existing dataset

The brown corpus is a seminal text with many of the categories you are speaking about. This could be a starting point to help classify your emails using some package like gensim to find semantically similar texts.

Once you classify your emails, you can then train a system to predict a label for each unseen email.

Upvotes: 0

Related Questions