Himanshu Jindal
Himanshu Jindal

Reputation: 637

How to get probability of spam rather than classification

I am building a tool to calculate a probability of a text review to be fake(spam) or real.

I have an annotated dataset of reviews marked as spam or nonspam. I have used svm to build a classifier, but that only gives me classification of an input document as spam or nonspam. Whereas, I want a tool that will give me a number between 0 and 1 representing probability of the document being spam. Can someone please point me in the right direction.

Upvotes: 0

Views: 310

Answers (3)

Ben Allison
Ben Allison

Reputation: 7394

If you want a continuous-valued score (rather than an explicit probability), you can just use the distance to the hyperplane from the SVM. This is a standard measure of confidence, which you can see as how far "into" the class the point is.

If you want to actually use the classifications as part of a broader probabilistic model, where you need something with a genuine probability interpretation, you could use one of the methods for converting SVM scores into probabilities, but these are somewhat retrofit and don't have great theoretical underpinnings. Instead, I'd suggest you take a look at the logistic regression classifier, sometimes known as Maximum Entropy, for a robust probabilistic alternative. This has the benefits of a discriminative model like SVM but with a natural and inherent probabilistic underpinning.

Upvotes: 1

snøreven
snøreven

Reputation: 1974

You can get the probability with a SVM. Take a look at libsvm (-b parameter).

Upvotes: 0

hd1
hd1

Reputation: 34677

Instead of writing your own, why not plug into akismet? Spam detection is Bayesian and performs better the more data you give it.

Upvotes: 0

Related Questions