Assign document to a category using document similarity

Question

I'm developing a NLP project in python.

I'm getting "conversation" from social networks. A conversation is made up of post_text + comment_text + reply_text (with comment_text and reply_text as optional).

I've also a list of categories, arguments, and I want to "connect" conversation to an argument (or get a weight for each argument).

For each category, I get the summary on Wikipedia, using wikipedia python package. So, they represent my training documents (right?).

Now, I've writed down some steps to follow, but maybe I'm wrong.

Each training document must be transformed to Vector Space Model. I've to remove stopwords and common words. So, I've a list of vocabulary.
Each conversation must be transformed to vector space model and each token must be assigned to its vocabulary index. I can save all vector space models in a matrix.
Now, I've to perform tf-idf (for example) on all matrix rows.
- In tf-idf I've to calculate tf, idf and normalize matrix?
So, each row represents tf-idf for each conversation. Now, I've to perform cosine-similarity (for example) to get similarity between each conversation and one training document. I've to iterate it to get similarity between conversations and each training document.

What do you think about the steps? Is there any guide/how to/book I've to read to understand better this problem?

Assign document to a category using document similarity

Answers (1)

Related Questions