Marsellus Wallace
Marsellus Wallace

Reputation: 18601

How to calculate Binary Classification Metrics in Spark MLlib with Dataframe API

I am using Spark MLlib with DataFrame API, given the following sample code:

val dtc = new DecisionTreeClassifier()
val testResults = dtc.fit(training).transform(test)

Can I calculate the model quality metrics over the testResult using the DataFrame API?

If not, how do I correctly transform my testResult (containing "label", "features", "rawPrediction", "probability", "prediction") so that I can use the BinaryClassificationMetrics (RDD API)?

NOTE: I am interested in the "byThreshold" metrics as well

Upvotes: 1

Views: 2693

Answers (1)

jamborta
jamborta

Reputation: 5210

If you look at the constructor of the BinaryClassificationMetrics, it takes an RDD[(Double, Double)], score and labels. You can convert the Dataframe to the right format like this:

val scoreAndLabels = testResults.select("label", "probability")
    .rdd
    .map(row => 
            (row.getAs[Vector]("probability")(1), row.getAs[Double]("label"))
    )

EDIT:

Probability is stored in a Vector that is the same length as the number of classes you'd like to predict. In the case of binary classification the first one would correspond to label = 0 and the second is label = 1, you should pick the column that is your positive label (normally label = 1).

Upvotes: 3

Related Questions