Genres classification of documents

I'm looking for library whatever it's machine learning or something else it doesn't matter which will help me categorize the content I have. Basically content I have is articles written and I wanna know which of them are politics or sport bla bla so I have categorize them.

I was trying openNLP but cannot get it working as I need, is there anything else that will solve my need?

I guess I need some kind of Machine learning with natural language processing NLP but I can't find something that will do my job at this point.

Upvotes: 1

Answers (2)

ML_TN

Reputation: 727

I would suggest two method and it depends on your data :

First if you know already how many classes you are going to have in your textual data, e.g. sports vs politics vs science. In this case you can use a supervised learning algorithm (SVM, MLP,LR ..).

In the second case where you don't know how many classes you will encounter in your data, it's best to use unsupervised learning algorithm LDA or LSI which will cluster documents with similar topics and you will only have to examine by hand some document from each cluster and assign a label to it.

As for you data representation you can use SKlearn or SPARK countvectorizer to create BoW (Bag of Word) vectors to feed to your learning algorithm.

I will just add that it's best (memory efficient and faster) to use scipy sparse vectors if you have a big vocabulary.

Upvotes: 0

Naveen Honest Raj K

Reputation: 362

This is a Naive implementation, but you could improvise it further. For classifying a paragraph under a category, first try to extract the unique words of the training data of a particular topic.

For example: Use NLTK to extract the unique words from the collection of paragraphs that talks about Sports and store it in a set. And then similarly do it for the other topics and store them in sets. Now subtract the common words in sets, so that you can now find the particular unique words that might represent a particular topic.

So, now when you input a paragraph it should give you the one-hot output. Now Combine all the unique words of the list.

Now when you are analyzing a paragraph and if you find those words, just put them as 1.

Like, after analysing your first paragraph, you might get the result as,

[ 0, 0, 1, 0, 1, .... 1, 0, 0] -> Hereby this denotes that the unique words in the position 3 is found and etc.

So your training data will be this as input and output as one-hot encoded. ie, if you have three categories, and if your first paragraph belongs to 1st topic, then outcome will be like [1,0,0].

Collect many inputs and outcomes to train and then test it with new inputs. You will get the higher probability on the topic it fits.

You can train it with basic neural network and a normal softmax loss function. This might take you just an hour to do.

All the best.

Upvotes: 1

Genres classification of documents

Answers (2)

Related Questions