Reputation: 51
I am a newbie in NLP, just doing it for the first time. I am trying to solve a problem.
My problem is I have some documents which are manually tagged like:
doc1 - categoryA, categoryB
doc2 - categoryA, categoryC
doc3 - categoryE, categoryF, categoryG
.
.
.
.
docN - categoryX
Here I have a fixed set of categories and any document can have any number of tags associated with it. I want to train the classifier using this input, so that this tagging process can be automated.
Thanks
Upvotes: 5
Views: 5730
Reputation: 90
its not clear what you have tried or what programming language you are using but as most have suggested try text classification with document vectors, bag of words (as long as there are words in the documents that can help with classification)
Here are some simple tools that can help get you started
Weka http://www.cs.waikato.ac.nz/ml/weka/ (GUI & Java)
NLTK http://www.nltk.org (Python)
Mallet http://mallet.cs.umass.edu/ (command line & Java)
NUML http://numl.net/ (C#)
Upvotes: 0
Reputation: 5407
Most of classifier works on Bag of word model . There are multiple use case to get expected result.
Try out most general Multinomial naive base classifer with changing different input paramters and check out result.
Try variants of ML Naive base (http://scikit-learn.org/0.11/modules/naive_bayes.html)
You can check out sentence classifier along with considering sentence structures. Considering ngram concepts, you can try out with 2,3,4,5 gram models and check how result varies. Count vectorizer allows ngram, check out this link for example - http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
Based on dataset features, not a single classifier can be best for you scenario, you have to check out different use case, which fits best for you.
Most initial approach is, you get started with simple classifier using scikit learn.
Put each category as traning class and train the classifier with this classes
For any input docX, classifier with trained model
threshold
like probability different between three most highest resulting category, if it matches the threshold consider those category as result for that input class.Upvotes: 3
Reputation: 8225
What you are trying to do is called multi-way supervised text categorization (or classification). Knowing the right question to ask is half the problem.
As for how this can be done, here are two references:
Upvotes: 4