Federico Cuozzo
Federico Cuozzo

Reputation: 361

Assign document to a category using document similarity

I'm developing a NLP project in python.

I'm getting "conversation" from social networks. A conversation is made up of post_text + comment_text + reply_text (with comment_text and reply_text as optional).

I've also a list of categories, arguments, and I want to "connect" conversation to an argument (or get a weight for each argument).

For each category, I get the summary on Wikipedia, using wikipedia python package. So, they represent my training documents (right?).

Now, I've writed down some steps to follow, but maybe I'm wrong.

What do you think about the steps? Is there any guide/how to/book I've to read to understand better this problem?

Upvotes: 0

Views: 199

Answers (1)

Azad
Azad

Reputation: 71

Instead of getting summary from Wikipedia and matching similarity you can train a classifier that given a summary can predict which document category it is. You can start with simplest Bag of word representation of summery from Wikipedia for classification then analyse the results and accuracy. After that can move forward to more sophisticate approach like word to vector or document to vector for word representation and then train a classifier.

After making classification model, for assigning category to your test document you need to clasify it using classification model.

Upvotes: 1

Related Questions