jonwithers
jonwithers

Reputation: 17

Interpreting rawPrediction from Spark ML LinearSVC

I am using Spark ML's LinearSVC in a binary classification model. The transform method creates two columns, prediction and rawPrediction. Spark's docs don't provide any way of interpreting the rawPrediction column for this particular classifier. This question has been asked and answered for other classifiers, but not specifically for LinearSVC.

The relevant column from my predictions dataframe:

+------------------------------------------+ 
|rawPrediction                             | 
+------------------------------------------+ 
|[0.8553257800650063,-0.8553257800650063]  | 
|[0.4230977574196645,-0.4230977574196645]  | 
|[0.49814263303537865,-0.49814263303537865]| 
|[0.9506355050332026,-0.9506355050332026]  | 
|[0.5826887000450813,-0.5826887000450813]  | 
|[1.057222808292026,-1.057222808292026]    | 
|[0.5744214192446275,-0.5744214192446275]  | 
|[0.8738081933835614,-0.8738081933835614]  | 
|[1.418173816502859,-1.418173816502859]    | 
|[1.0854125533426737,-1.0854125533426737]  | 
+------------------------------------------+

Clearly this isn't simply the probability of belonging to each class. What is it?

Edit: Since the input code has been requested, here's a model built on a subset of features in the original dataset. Fitting any data with Spark's LinearSVC will produce this column.

var df = sqlContext
  .read
  .format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load("/FileStore/tables/full_frame_20180716.csv")


var assembler = new VectorAssembler()
  .setInputCols(Array("oy_length", "ah_length", "ey_length", "vay_length", "oh_length", 
                      "longest_word_length", "total_words", "repeated_exact_words",
                      "repeated_bigrams", "repeated_lemmatized_words", 
                      "repeated_lemma_bigrams"))
  .setOutputCol("features")

df = assembler.transform(df)

var Array(train, test) = df.randomSplit(Array(.8,.2), 42)

var supvec = new LinearSVC()
  .setLabelCol("written_before_2004")
  .setMaxIter(10)
  .setRegParam(0.001)

var supvecModel = supvec.fit(train)

var predictions = supvecModel.transform(test)

predictions.select("rawPrediction").show(20, false)

Output:

+----------------------------------------+ 
|rawPrediction | 
+----------------------------------------+ 
|[1.1502868455791242,-1.1502868455791242]| 
|[0.853488887006264,-0.853488887006264] | 
|[0.8064994501574174,-0.8064994501574174]| 
|[0.7919862003563363,-0.7919862003563363]| 
|[0.847418035176922,-0.847418035176922] | 
|[0.9157433788236442,-0.9157433788236442]| 
|[1.6290888181913814,-1.6290888181913814]| 
|[0.9402461917731906,-0.9402461917731906]| 
|[0.9744052798627367,-0.9744052798627367]| 
|[0.787542624053347,-0.787542624053347] | 
|[0.8750602657901001,-0.8750602657901001]| 
|[0.7949414037722276,-0.7949414037722276]| 
|[0.9163545832998052,-0.9163545832998052]| 
|[0.9875454213431247,-0.9875454213431247]| 
|[0.9193015302646135,-0.9193015302646135]| 
|[0.9828623328048487,-0.9828623328048487]| 
|[0.9175976004208621,-0.9175976004208621]| 
|[0.9608750388820302,-0.9608750388820302]| 
|[1.029326217566756,-1.029326217566756] | 
|[1.0190290910146256,-1.0190290910146256]| +----------------------------------------+ 
only showing top 20 rows

Upvotes: 0

Views: 774

Answers (2)

mat
mat

Reputation: 1

As it is mention by arpad, it is the margin.

And the margin is:

      margin = coefficients * feature + intercept    
                            or
                     y = w * x + b

If you divide the margin by the norm of the coefficients, you will get the distance to the hyperplane for each data point.

Upvotes: 0

arpad
arpad

Reputation: 422

It is (-margin, margin).

override protected def predictRaw(features: Vector): Vector = {
    val m = margin(features)
    Vectors.dense(-m, m)
  }

Upvotes: 1

Related Questions