Reputation: 516
I have thousands of documents with associated tag information. However i also have many documents without tags.
I want to train a model on the documents WITH tags and then apply the trained classifier to the UNTAGGED documents; the classifier will then suggest the most appropriate tags for each UNTAGGED document.
I have done quite a lot of research and there doesn't seem to be a SUPERVISED implementation to document tag classification.
I know NLTK, gensim, word2vec and other libraries will be useful for this problem.
I will be coding the project in Python.
Any help would be greatly appreciated.
Upvotes: 0
Views: 541
Reputation: 1596
I'm currently working on something similar, besides what @Joonatan Samuel suggested I would encourage you to do careful preprocessing and considerations.
Upvotes: 0
Reputation: 651
Depending on your actual use-case you might opt for more complex method but for minimum working model do:
1) Prepocessing of documents: tokenize, build vocabulary (NLTK has tools for this)
2) Do bag-of-words encoding per document
3) Train a machine learning model with onehot encoding for outputs. Start from sklearn random forest, logistic regression, SVM.
Upvotes: 1