SecQuestionnA
SecQuestionnA

Reputation: 47

Label text documents - Supervised Machine Learning

I'm currently working on a project where I'm taking emails, stripping out the message bodies using the email package, then I want to categorize them using labels like sports, politics, technology, etc...I've successfully stripped the message bodies out of my emails. I'm looking to start classifying.

To make multiple labels like sports, technology, politics, entertainment I need some set of words of each one to make the labelling. Example for

Sports label will have the label data: Football, Soccer, Hockey……

Where can I find online label data to help me ?

Upvotes: 2

Views: 858

Answers (3)

oshribr
oshribr

Reputation: 666

You can use DMOZ.

Be award, there are different kinds of text. For e.g one of the most common words in email-text will be Hi or Hello but in wiki-text Hi and Hello will not be common words

Upvotes: 2

user3761001
user3761001

Reputation: 128

You can use the BBC dataset. It has labeled news articles which can help.

for feature extraction, remove stopwords, do stemming, use n-gram with tf-idf, and than choose the best features

Upvotes: 1

EranP
EranP

Reputation: 11

What you're trying to do is called topic modeling: https://en.wikipedia.org/wiki/Topic_model

The list of topics is very dependent on your training dataset and the ultimate purpose for which you're building this. A good place to start can be here: https://nlp.stanford.edu/software/tmt/tmt-0.4/

You can look on their topics, but you can probably also use it to give some initial topics to your data and just work on top of their topics.

Upvotes: 1

Related Questions