pwhc
pwhc

Reputation: 516

supervised tag suggestion for documents

I have thousands of documents with associated tag information. However i also have many documents without tags.

I want to train a model on the documents WITH tags and then apply the trained classifier to the UNTAGGED documents; the classifier will then suggest the most appropriate tags for each UNTAGGED document.

I have done quite a lot of research and there doesn't seem to be a SUPERVISED implementation to document tag classification.

I know NLTK, gensim, word2vec and other libraries will be useful for this problem.

I will be coding the project in Python.

Any help would be greatly appreciated.

Upvotes: 0

Views: 541

Answers (2)

Diego Aguado
Diego Aguado

Reputation: 1596

I'm currently working on something similar, besides what @Joonatan Samuel suggested I would encourage you to do careful preprocessing and considerations.

  1. If you want two or more tags for documents you could train several model : one model per tag. You need to consider if there will be enough cases for each model (tag)
  2. If you have a lot of tags, you could run into a problem with document-tag cases like above.
  3. Stick to most common tag prediction don't try to predict all tags.

Upvotes: 0

Joonatan Samuel
Joonatan Samuel

Reputation: 651

Depending on your actual use-case you might opt for more complex method but for minimum working model do:

1) Prepocessing of documents: tokenize, build vocabulary (NLTK has tools for this)

2) Do bag-of-words encoding per document

3) Train a machine learning model with onehot encoding for outputs. Start from sklearn random forest, logistic regression, SVM.

Upvotes: 1

Related Questions