Reputation: 9480
I am learning Machine Learning, and so far I have used logistic regression with problems with balanced data like sentiment analysis where I had equal number of training data for both classes (+ve,-ve).
Now, I am working on Named Entity Recognition problem where I have to identify names of people in the text. For this my data is sparse, less than 10% of my training data is +ve case(actually a person tag), reset is negative case(not a person tag). So there is massive imbalance in my training data.
Will a supervised learning algorithm work in this scenario?
Upvotes: 2
Views: 81
Reputation: 322
It all depends on your results.
Extreme scenario aside: Consider the following scenario: You run your model, and observe your error is somewhat low like 5%. However, in reality, this 5% was because you wrongly classified HALF of your negative data (failed to recognize person's name), then obviously 5% error seems much worse now.
One thing you should do is calculate your precision and recall.
Precision:
Out of all the words that we predicted are peoples' names, what fraction of them actually are?
Precision = # true positives / (# of true positives + # of false positives)
High precision (close to 1) means you have few # of false positives. In other words, your model is pulling a Dos Equis: "I don't always predict positive, but when I do, I get it right"
However, high precision does not tell you whether you classified ALL actual positive examples correctly.
Recall:
Of all the examples that are actually peoples' names, what fraction of those did we predict correctly?
Recall = # true positives / (# true positives + # false negatives)
High recall (close to 1) means we are correctly classifying the positive examples, and moreover, not misclassifying negative examples
However, high recall doesn't tell you whether your model misclassified some negative examples as positives (false positives)
Precision and recall can be tradeoffs (could have lots of one at the expense of the other). If so, look into how to calculate F1 score (simple), which determines whether your model has a sufficient amount of precision and recall.
A good F1 score is close to 1. This should tell you more about how well your model is classifying text as people names.
Upvotes: 1
Reputation: 77837
Yes; it will work fine, so long as you have enough data on each side to properly define the class. The amount you need depends on the classification method you use. In fact, I have a couple of SVM models that work very nicely, trained with nothing but +ve data -- no -ve data at all!
For most methods, the lopsided input suggests that you could toss out the 80% of your +ve cases that aren't doing as much to define the boundary. Which 80% will vary with the method. For instance, spectral clustering and k-means will work well enough if you remove 80% evenly spaced (at random is likely to work). Linear SVM works if you keep only the 10% nearest the boundary. Naive Bayes and random forest can also work nicely with a random 80% removal, although any of these that work by successive refinement may converge a little more slowly.
Upvotes: 1