Probability of predictions using Spark LogisticRegressionWithLBFGS for multiclass classification

I am using LogisticRegressionWithLBFGS() to train a model with multiple classes.

From the documentation in mllib it is written that clearThreshold() can be used only if the classification is binary. Is there a way to use something similar for multiclass classification in order to output the probabilities of each class in a given input in the model?

Upvotes: 2

Answers (1)

Brian

Reputation: 7326

There are two ways to accomplish this. One is to create a method that assumes the responsibility of predictPoint in LogisticRegression.scala

object ClassificationUtility {
  def predictPoint(dataMatrix: Vector, model: LogisticRegressionModel):
    (Double, Array[Double]) = {
    require(dataMatrix.size == model.numFeatures)
    val dataWithBiasSize: Int = model.weights.size / (model.numClasses - 1)
    val weightsArray: Array[Double] = model.weights match {
      case dv: DenseVector => dv.values
      case _ =>
        throw new IllegalArgumentException(s"weights only supports dense vector but got type ${model.weights.getClass}.")
    }
    var bestClass = 0
    var maxMargin = 0.0
    val withBias = dataMatrix.size + 1 == dataWithBiasSize
    val classProbabilities: Array[Double] = new Array[Double (model.numClasses)
    (0 until model.numClasses - 1).foreach { i =>
      var margin = 0.0
      dataMatrix.foreachActive { (index, value) =>
      if (value != 0.0) margin += value * weightsArray((i * dataWithBiasSize) + index)
      }
      // Intercept is required to be added into margin.
      if (withBias) {
        margin += weightsArray((i * dataWithBiasSize) + dataMatrix.size)
      }
      if (margin > maxMargin) {
        maxMargin = margin
        bestClass = i + 1
      }
      classProbabilities(i+1) = 1.0 / (1.0 + Math.exp(-margin))
    }
    return (bestClass.toDouble, classProbabilities)
  }
}

Note it is only slightly different from the original method, it just calculates the logistic as a function of the input features. It also defines some vals and vars that are originally private and included outside of this method. Ultimately, it indexes the scores in an Array and returns it along with the best answer. I call my method like so:

// Compute raw scores on the test set.
val predictionAndLabelsAndProbabilities = test
  .map { case LabeledPoint(label, features) =>
val (prediction, probabilities) = ClassificationUtility
  .predictPoint(features, model)
(prediction, label, probabilities)}

However:

It seems the Spark contributors are discouraging the use of MLlib in favor of ML. The ML logistic regression API currently does not support multi-class classification. I am now using OneVsRest which acts as a wrapper for one vs all classification. You can obtain the raw scores by iterating through the models:

val lr = new LogisticRegression().setFitIntercept(true)
val ovr = new OneVsRest()
ovr.setClassifier(lr)
val ovrModel = ovr.fit(training)
ovrModel.models.zipWithIndex.foreach {
  case (model: LogisticRegressionModel, i: Int) =>
    model.save(s"model-${model.uid}-$i")
}

val model0 = LogisticRegressionModel.load("model-logreg_457c82141c06-0")
val model1 = LogisticRegressionModel.load("model-logreg_457c82141c06-1")
val model2 = LogisticRegressionModel.load("model-logreg_457c82141c06-2")

Now that you have the individual models, you can obtain the probabilities by calculating the sigmoid of the rawPrediction

def sigmoid(x: Double): Double = {
  1.0 / (1.0 + Math.exp(-x))
}

val newPredictionAndLabels0 = model0.transform(newRescaledData)
  .select("prediction", "rawPrediction")
  .map(row => (row.getDouble(0),
    sigmoid(row.getAs[org.apache.spark.mllib.linalg.DenseVector](1).values(1)) ))
newPredictionAndLabels0.foreach(println)

val newPredictionAndLabels1 = model1.transform(newRescaledData)
  .select("prediction", "rawPrediction")
  .map(row => (row.getDouble(0),
    sigmoid(row.getAs[org.apache.spark.mllib.linalg.DenseVector](1).values(1)) ))
newPredictionAndLabels1.foreach(println)

val newPredictionAndLabels2 = model2.transform(newRescaledData)
  .select("prediction", "rawPrediction")
  .map(row => (row.getDouble(0),
    sigmoid(row.getAs[org.apache.spark.mllib.linalg.DenseVector](1).values(1)) ))
newPredictionAndLabels2.foreach(println)

Upvotes: 0

Probability of predictions using Spark LogisticRegressionWithLBFGS for multiclass classification

Answers (1)

Related Questions