cherpa123
cherpa123

Reputation: 1

multi-label text classification with zero or more labels

I need to classify website text with zero or more categories/labels (5 labels such as finance, tech, etc). My problem is handling text that isn't one of these labels.

I tried ML libraries (maxent, naive bayes), but they match "other" text incorrectly with one of the labels. How do I train a model to handle the "other" text? The "other" label is so broad and it's not possible to pick a representative sample.

Since I have no ML background and don't have much time to build a good training set, I'd prefer a simpler approach like a term frequency count, using a predefined list of terms to match for each label. But with the counts, how do I determine a relevancy score, i.e. if the text is actually that label? I don't have a corpus and can't use tf-idf, etc.

Upvotes: 0

Views: 2366

Answers (5)

cherpa123
cherpa123

Reputation: 1

AWS Elasticsearch percolate would be ideal, but we can't use it due to the HTTP overhead of percolating documents individually.

Classify4J appears to be the best solution for our needs because the model looks easy to train and it doesn't require training of non-matches. http://classifier4j.sourceforge.net/usage.html

Upvotes: 0

arjun
arjun

Reputation: 1614

Try this NBayes implementation.
For identifying "Other" categories, dont bother much. Just train on your required categories which clearly identifies them, and introduce a threshold in the classifier.
If the values for a label does not cross a threshold, then the classifier adds the "Other" label.

It's all in the training data.

Upvotes: 0

Bastian
Bastian

Reputation: 1593

What about creating histograms? You could use a bag of words approach using significant indicators of for e.g. Tech and Finance. So, you could try to identify such indicators by analyzing the certain website's tags and articles or just browse the web for such inidicators:

http://finance.yahoo.com/news/most-common-words-tech-finance-205911943.html

Let's say your input vactor X has n dimensions where n represents the number of indicators. For example Xi then holds the count for the occurence of the word "asset" and Xi+k the count of the word "big data" in the current article.

Instead of defining 5 labels, define 6. Your last category would be something like a "catch-all" category. That's actually your zero-match category.

If you must match the zero or more category, train a model which returns probability scores (such as a neural net as Luis Leal suggested) per label/class. You could than rate your output by that score and say that every class with a score higher than some threshold t is a matching category.

Upvotes: 0

Luis Leal
Luis Leal

Reputation: 3514

Another idea , is to user neural networks with softmax output function, softmax will give you a probability for every class, when the network is very confident about a class, will give it a high probability, and lower probabilities to the other classes, but if its insecure, the differences between probabilities will be low and none of them will be very high, what if you define a treshold like : if the probability for every class is less than 70% , predict "other"

Upvotes: 2

Prune
Prune

Reputation: 77857

Whew! Classic ML algorithms don't combine both multi-classification and "in/out" at the same time. Perhaps what you could do would be to train five models, one for each class, with a one-against-the-world training. Then use an uber-model to look for any of those five claiming the input; if none claim it, it's "other".

Another possibility is to reverse the order of evaluation: train one model as a binary classifier on your entire data set. Train a second one as a 5-class SVM (for instance) within those five. The first model finds "other"; everything else gets passed to the second.

Upvotes: 1

Related Questions