Reputation: 65

Predicting Probabilities in Logistic Regression Model in Apache Spark MLib

I am working on Apache Spark to build the LRM using the LogisticRegressionWithLBFGS() class provided by MLib. Once the Model is built, we can use the predict function provided which gives only the binary labels as the output. I also want the probabilities to be calculated for the same.

There is an implementation for the same found in

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala

override protected def predictPoint(
  dataMatrix: Vector,
  weightMatrix: Vector,
  intercept: Double) = {
require(dataMatrix.size == numFeatures)

// If dataMatrix and weightMatrix have the same dimension, it's binary logistic regression.
if (numClasses == 2) {
  val margin = dot(weightMatrix, dataMatrix) + intercept
  val score = 1.0 / (1.0 + math.exp(-margin))
  threshold match {
    case Some(t) => if (score > t) 1.0 else 0.0
    case None => score
  }
}

This method is not exposed, and also the probabilities are not available. Can I know how to use this function to get probabilities. The dot method which is used in the above function is also not exposed, it is present in the BLAS Package but it is not public.

Upvotes: 1

Answers (3)

Brian

Reputation: 7326

I encountered a similar problem in trying to obtain the raw predictions for a multiples problem. For me, the best solution was to create a method by borrowing and customizing from the Spark MLlib Logistic Regression src. You can create a like so:

object ClassificationUtility {
  def predictPoint(dataMatrix: Vector, model: LogisticRegressionModel):
    (Double, Array[Double]) = {
    require(dataMatrix.size == model.numFeatures)
    val dataWithBiasSize: Int = model.weights.size / (model.numClasses - 1)
    val weightsArray: Array[Double] = model.weights match {
      case dv: DenseVector => dv.values
      case _ =>
        throw new IllegalArgumentException(
          s"weights only supports dense vector but got type ${model.weights.getClass}.")
    }
    var bestClass = 0
    var maxMargin = 0.0
    val withBias = dataMatrix.size + 1 == dataWithBiasSize
    val classProbabilities: Array[Double] = new Array[Double](model.numClasses)
    (0 until model.numClasses - 1).foreach { i =>
      var margin = 0.0
      dataMatrix.foreachActive { (index, value) =>
      if (value != 0.0) margin += value * weightsArray((i * dataWithBiasSize) + index)
      }
      // Intercept is required to be added into margin.
      if (withBias) {
        margin += weightsArray((i * dataWithBiasSize) + dataMatrix.size)
      }
      if (margin > maxMargin) {
        maxMargin = margin
        bestClass = i + 1
      }
      classProbabilities(i+1) = 1.0 / (1.0 + Math.exp(-(margin - maxMargin)))
    }
    return (bestClass.toDouble, classProbabilities)
  }
}

Note it is only slightly different from the original method, it just calculates the logistic as a function of the input features. It also defines some vals and vars that are originally private and included outside of this method. Ultimately, it indexes the scores in an Array and returns it along with the best answer. I call my method like so:

// Compute raw scores on the test set.
val predictionAndLabelsAndProbabilities = test
  .map { case LabeledPoint(label, features) =>
val (prediction, probabilities) = ClassificationUtility
  .predictPoint(features, model)
(prediction, label, probabilities)}

However:

It seems the Spark contributors are discouraging the use of MLlib in favor of ML. The ML logistic regression API currently does not support multiples classification. I am now using OneVsRest which acts as a wrapper for one vs all classification. I am working on a similar customization to get the raw scores.

Upvotes: 3

wellplacedadjective

Reputation: 23

I believe the call is myModel.clearThreshold(); i.e. myModel.clearThreshold without the parentheses fails. See the linear SVM example here.

Upvotes: 0

selvinsource

Reputation: 1837

Call myModel.clearThreshold to get the raw prediction instead of the 0/1 labels.

Mind this only works for Binary Logistic Regression (numClasses == 2).

Upvotes: 5

Predicting Probabilities in Logistic Regression Model in Apache Spark MLib

Answers (3)

Related Questions