Reputation: 17
I am using Spark ML's LinearSVC in a binary classification model. The transform
method creates two columns, prediction
and rawPrediction
. Spark's docs don't provide any way of interpreting the rawPrediction
column for this particular classifier. This question has been asked and answered for other classifiers, but not specifically for LinearSVC.
The relevant column from my predictions
dataframe:
+------------------------------------------+
|rawPrediction |
+------------------------------------------+
|[0.8553257800650063,-0.8553257800650063] |
|[0.4230977574196645,-0.4230977574196645] |
|[0.49814263303537865,-0.49814263303537865]|
|[0.9506355050332026,-0.9506355050332026] |
|[0.5826887000450813,-0.5826887000450813] |
|[1.057222808292026,-1.057222808292026] |
|[0.5744214192446275,-0.5744214192446275] |
|[0.8738081933835614,-0.8738081933835614] |
|[1.418173816502859,-1.418173816502859] |
|[1.0854125533426737,-1.0854125533426737] |
+------------------------------------------+
Clearly this isn't simply the probability of belonging to each class. What is it?
Edit: Since the input code has been requested, here's a model built on a subset of features in the original dataset. Fitting any data with Spark's LinearSVC will produce this column.
var df = sqlContext
.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/FileStore/tables/full_frame_20180716.csv")
var assembler = new VectorAssembler()
.setInputCols(Array("oy_length", "ah_length", "ey_length", "vay_length", "oh_length",
"longest_word_length", "total_words", "repeated_exact_words",
"repeated_bigrams", "repeated_lemmatized_words",
"repeated_lemma_bigrams"))
.setOutputCol("features")
df = assembler.transform(df)
var Array(train, test) = df.randomSplit(Array(.8,.2), 42)
var supvec = new LinearSVC()
.setLabelCol("written_before_2004")
.setMaxIter(10)
.setRegParam(0.001)
var supvecModel = supvec.fit(train)
var predictions = supvecModel.transform(test)
predictions.select("rawPrediction").show(20, false)
Output:
+----------------------------------------+
|rawPrediction |
+----------------------------------------+
|[1.1502868455791242,-1.1502868455791242]|
|[0.853488887006264,-0.853488887006264] |
|[0.8064994501574174,-0.8064994501574174]|
|[0.7919862003563363,-0.7919862003563363]|
|[0.847418035176922,-0.847418035176922] |
|[0.9157433788236442,-0.9157433788236442]|
|[1.6290888181913814,-1.6290888181913814]|
|[0.9402461917731906,-0.9402461917731906]|
|[0.9744052798627367,-0.9744052798627367]|
|[0.787542624053347,-0.787542624053347] |
|[0.8750602657901001,-0.8750602657901001]|
|[0.7949414037722276,-0.7949414037722276]|
|[0.9163545832998052,-0.9163545832998052]|
|[0.9875454213431247,-0.9875454213431247]|
|[0.9193015302646135,-0.9193015302646135]|
|[0.9828623328048487,-0.9828623328048487]|
|[0.9175976004208621,-0.9175976004208621]|
|[0.9608750388820302,-0.9608750388820302]|
|[1.029326217566756,-1.029326217566756] |
|[1.0190290910146256,-1.0190290910146256]| +----------------------------------------+
only showing top 20 rows
Upvotes: 0
Views: 774
Reputation: 1
As it is mention by arpad, it is the margin.
And the margin is:
margin = coefficients * feature + intercept
or
y = w * x + b
If you divide the margin by the norm of the coefficients, you will get the distance to the hyperplane for each data point.
Upvotes: 0
Reputation: 422
It is (-margin, margin)
.
override protected def predictRaw(features: Vector): Vector = {
val m = margin(features)
Vectors.dense(-m, m)
}
Upvotes: 1