Rod0n
Rod0n

Reputation: 1079

Classification training with only positive sentences

I'm starting a project to build an automated fact checking classificator nad I have some doubts about the process to follow.

I've a database of ~1000 sentences, each one being a fact check positive. In order to build a supervised machine learning model I'll need to have a big set of tagged sentences with the true/false result depending if it's a fact check candidate sentence or not. That would require a lot of time and effort, so I'd like to first get results (with less accuracy I guess) without doing that.

My idea is to use the already tagged positive sentences and apply a PoS tagger to them. This would give me interesting information to spot some patterns like the most common words(e.g: raised, increase, won) and the post tags (e.g. verbs in past/present tense, time and numerals).

With this results I'm thinking about assigning weights in order to analyze new unclassified sentences. The problem is that the weight assignment would be done by me in an "heuristical" way. It'd be best to use the results of the PoS tagger to train some model which assigns probabilities in a more sophisticated way.

Could you give me some pointers if there's a way to accomplish this?

I read about Maximum Entropy Classifiers and statistical parsers but I really don't know if they're the right choice.

Edit (I think it'd be better to give more details):

Parsing the sentences with a PoS tagger will give me some useful information about each one of them, allowing me to filter them and weighting them using some custom metrics.

For example:

There are one million more people in poverty than five years ago -> indicatives of a fact check candidate sentence: verb in present tense, numerals and dates, (than) comparison.

We will increase the GDP by 3% the following year -> indicatives of a NOT fact check candidate sentence: it's in the future tense (indicative of some sort of prediction)

Upvotes: 0

Views: 318

Answers (1)

Breck Baldwin
Breck Baldwin

Reputation: 171

This situation happens often when the true sentences are relatively rare in the data.

1) Get a corpus of sentences that resemble what you will be classifying in the end. The corpus will contain both true and false sentences. Label them as false or non-fact check. We are assuming they are all false even though we know this is not the case. You want the ratio of true/false data created to be approximately its actual distribution if at all possible. So if 10% are true in real data then your assumed false cases are 90% or 9,000 for you 1,000 trues. If you don't know the distribution then just make it 10x or more.

2) Train logistic regression classifier aka maximum entropy on the data with cross validation. Keep track of the high scoring false positives on the held out data.

3) Re-annotate false positives down to what ever score makes sense for possibly being true positives. This will hopefully clean your assumed false data.

4) Keep running this process until you are no longer improving the classifier.

5) To get your "fact check words" then make sure your feature extractor is feeding words to your classifier and look for those that are positively associated with the true category--any decent logistic regression classifier should provide the feature weights in some way. I use LingPipe which certainly does.

6) I don't see how PoS (Part of Speech) helps with this problem.

This approach will fail to find true instances that are very different from the training data but it can work none the less.

Breck

Upvotes: 4

Related Questions