Reputation: 47
I'm currently working on a project where I'm taking emails, stripping out the message bodies using the email package, then I want to categorize them using labels like sports, politics, technology, etc...I've successfully stripped the message bodies out of my emails. I'm looking to start classifying.
To make multiple labels like sports, technology, politics, entertainment I need some set of words of each one to make the labelling. Example for
Sports label will have the label data: Football, Soccer, Hockey……
Where can I find online label data to help me ?
Upvotes: 2
Views: 858
Reputation: 666
You can use DMOZ.
Be award, there are different kinds of text. For e.g one of the most common words in email-text will be Hi
or Hello
but in wiki-text Hi
and Hello
will not be common words
Upvotes: 2
Reputation: 128
You can use the BBC dataset. It has labeled news articles which can help.
for feature extraction, remove stopwords, do stemming, use n-gram with tf-idf, and than choose the best features
Upvotes: 1
Reputation: 11
What you're trying to do is called topic modeling: https://en.wikipedia.org/wiki/Topic_model
The list of topics is very dependent on your training dataset and the ultimate purpose for which you're building this. A good place to start can be here: https://nlp.stanford.edu/software/tmt/tmt-0.4/
You can look on their topics, but you can probably also use it to give some initial topics to your data and just work on top of their topics.
Upvotes: 1