user1168811
user1168811

Reputation: 51

NLP text tagging

I am a newbie in NLP, just doing it for the first time. I am trying to solve a problem.

My problem is I have some documents which are manually tagged like:

doc1 - categoryA, categoryB
doc2 - categoryA, categoryC
doc3 - categoryE, categoryF, categoryG
.
.
.
.
docN - categoryX

Here I have a fixed set of categories and any document can have any number of tags associated with it. I want to train the classifier using this input, so that this tagging process can be automated.

Thanks

Upvotes: 5

Views: 5730

Answers (3)

ryder1211212
ryder1211212

Reputation: 90

its not clear what you have tried or what programming language you are using but as most have suggested try text classification with document vectors, bag of words (as long as there are words in the documents that can help with classification)

Here are some simple tools that can help get you started

Weka http://www.cs.waikato.ac.nz/ml/weka/ (GUI & Java)
NLTK http://www.nltk.org (Python)
Mallet http://mallet.cs.umass.edu/ (command line & Java)
NUML http://numl.net/ (C#)

Upvotes: 0

user123
user123

Reputation: 5407

Most of classifier works on Bag of word model . There are multiple use case to get expected result.

  1. Try out most general Multinomial naive base classifer with changing different input paramters and check out result.

  2. Try variants of ML Naive base (http://scikit-learn.org/0.11/modules/naive_bayes.html)

  3. You can check out sentence classifier along with considering sentence structures. Considering ngram concepts, you can try out with 2,3,4,5 gram models and check how result varies. Count vectorizer allows ngram, check out this link for example - http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

Based on dataset features, not a single classifier can be best for you scenario, you have to check out different use case, which fits best for you.

Most initial approach is, you get started with simple classifier using scikit learn.

  1. Put each category as traning class and train the classifier with this classes

  2. For any input docX, classifier with trained model

  3. You will get probability result for each category
  4. Now put some threshold like probability different between three most highest resulting category, if it matches the threshold consider those category as result for that input class.

Upvotes: 3

John Lehmann
John Lehmann

Reputation: 8225

What you are trying to do is called multi-way supervised text categorization (or classification). Knowing the right question to ask is half the problem.

As for how this can be done, here are two references:

Upvotes: 4

Related Questions