Reputation: 199
I am a beginner at NLTK and machine learning with the goal of giving uncertainty ratings to sentences.
For example, a sentence like This is likely caused by a..
would receive a certainty score of say 6, where as There is definitely something wrong with me
would receive a 10 and I think it could possibly happen
would score a 3.
Regardless of the score system, a classification of "certain" and "uncertain" can also suffice my needs.
I did not find any existing works on this. How would I approach this? I do have some untrained text data.
Upvotes: 4
Views: 1670
Reputation: 11190
As far as I know, existing nlp toolkits do not have such feature.
You have to train your own model and for that you need training data. If you have a dataset that contains uncertainty labels for each sentence, then you can train a text classification model on that.
If you don't have labeled data, there was a CoNLL 2010 Shared task on detecting uncertainty/hedging and the dataset for that should be available. You can access the CoNLL 2010 dataset and train a simple text classifier on that and use the trained model on your own dataset. Assuming that the nature of your data is not very different than theirs, this should work.
For text classification, you can simply use scikit-learn library which is straight forward.
You might also find the following references useful:
Rubin, Victoria et al. "Certainty identification in texts: Categorization model and manual tagging results." Computing attitude and affect in text: Theory and applications. 2006. 61-76.
Medlock, Ben, and Ted Briscoe. "Weakly supervised learning for hedge classification in scientific literature." ACL. Vol. 2007. 2007.
Upvotes: 5