Reputation: 65
I am working on Apache Spark to build the LRM using the LogisticRegressionWithLBFGS() class provided by MLib. Once the Model is built, we can use the predict function provided which gives only the binary labels as the output. I also want the probabilities to be calculated for the same.
There is an implementation for the same found in
override protected def predictPoint(
dataMatrix: Vector,
weightMatrix: Vector,
intercept: Double) = {
require(dataMatrix.size == numFeatures)
// If dataMatrix and weightMatrix have the same dimension, it's binary logistic regression.
if (numClasses == 2) {
val margin = dot(weightMatrix, dataMatrix) + intercept
val score = 1.0 / (1.0 + math.exp(-margin))
threshold match {
case Some(t) => if (score > t) 1.0 else 0.0
case None => score
}
}
This method is not exposed, and also the probabilities are not available. Can I know how to use this function to get probabilities. The dot method which is used in the above function is also not exposed, it is present in the BLAS Package but it is not public.
Upvotes: 1
Views: 7107
Reputation: 7326
I encountered a similar problem in trying to obtain the raw predictions for a multiples problem. For me, the best solution was to create a method by borrowing and customizing from the Spark MLlib Logistic Regression src. You can create a like so:
object ClassificationUtility {
def predictPoint(dataMatrix: Vector, model: LogisticRegressionModel):
(Double, Array[Double]) = {
require(dataMatrix.size == model.numFeatures)
val dataWithBiasSize: Int = model.weights.size / (model.numClasses - 1)
val weightsArray: Array[Double] = model.weights match {
case dv: DenseVector => dv.values
case _ =>
throw new IllegalArgumentException(
s"weights only supports dense vector but got type ${model.weights.getClass}.")
}
var bestClass = 0
var maxMargin = 0.0
val withBias = dataMatrix.size + 1 == dataWithBiasSize
val classProbabilities: Array[Double] = new Array[Double](model.numClasses)
(0 until model.numClasses - 1).foreach { i =>
var margin = 0.0
dataMatrix.foreachActive { (index, value) =>
if (value != 0.0) margin += value * weightsArray((i * dataWithBiasSize) + index)
}
// Intercept is required to be added into margin.
if (withBias) {
margin += weightsArray((i * dataWithBiasSize) + dataMatrix.size)
}
if (margin > maxMargin) {
maxMargin = margin
bestClass = i + 1
}
classProbabilities(i+1) = 1.0 / (1.0 + Math.exp(-(margin - maxMargin)))
}
return (bestClass.toDouble, classProbabilities)
}
}
Note it is only slightly different from the original method, it just calculates the logistic as a function of the input features. It also defines some vals and vars that are originally private and included outside of this method. Ultimately, it indexes the scores in an Array and returns it along with the best answer. I call my method like so:
// Compute raw scores on the test set.
val predictionAndLabelsAndProbabilities = test
.map { case LabeledPoint(label, features) =>
val (prediction, probabilities) = ClassificationUtility
.predictPoint(features, model)
(prediction, label, probabilities)}
However:
It seems the Spark contributors are discouraging the use of MLlib
in favor of ML
. The ML logistic regression API currently does not support multiples classification. I am now using OneVsRest which acts as a wrapper for one vs all classification. I am working on a similar customization to get the raw scores.
Upvotes: 3
Reputation: 23
I believe the call is myModel.clearThreshold()
; i.e. myModel.clearThreshold
without the parentheses fails. See the linear SVM example here.
Upvotes: 0
Reputation: 1837
Call myModel.clearThreshold
to get the raw prediction instead of the 0/1 labels.
Mind this only works for Binary Logistic Regression (numClasses == 2).
Upvotes: 5