Reputation: 637
I am building a tool to calculate a probability of a text review to be fake(spam) or real.
I have an annotated dataset of reviews marked as spam or nonspam. I have used svm to build a classifier, but that only gives me classification of an input document as spam or nonspam. Whereas, I want a tool that will give me a number between 0 and 1 representing probability of the document being spam. Can someone please point me in the right direction.
Upvotes: 0
Views: 310
Reputation: 7394
If you want a continuous-valued score (rather than an explicit probability), you can just use the distance to the hyperplane from the SVM. This is a standard measure of confidence, which you can see as how far "into" the class the point is.
If you want to actually use the classifications as part of a broader probabilistic model, where you need something with a genuine probability interpretation, you could use one of the methods for converting SVM scores into probabilities, but these are somewhat retrofit and don't have great theoretical underpinnings. Instead, I'd suggest you take a look at the logistic regression classifier, sometimes known as Maximum Entropy, for a robust probabilistic alternative. This has the benefits of a discriminative model like SVM but with a natural and inherent probabilistic underpinning.
Upvotes: 1
Reputation: 1974
You can get the probability with a SVM. Take a look at libsvm (-b parameter).
Upvotes: 0