Onder
Onder

Reputation: 21

Binary Classification Evaluator AUC Score in Pyspark

I have a dataset with 2 classes (churners and non-churners) in the ratio 1:4. I used Random Forests algorithm via Spark MLlib. My model is terrible at predicting churn class and does nothing. I use BinaryClassificationEvaluator to evaluate my model in Pyspark. The default metric for the BinaryClassificationEvaluator is AreaUnderRoc.

My code

from pyspark.ml.classification import RandomForestClassifier
evaluator = BinaryClassificationEvaluator()

# Create an initial RandomForest model.
rf = RandomForestClassifier(labelCol="label", featuresCol="indexedFeatures", numTrees=1000,impurity="entropy")
# Train model with Training Data
rfModel = rf.fit(train_df)
rfModel.featureImportances

# Make predictions on test data using the Transformer.transform() method.
predictions = rfModel.transform(test_df)

# AUC Evaluate best model
evaluator.evaluate(predictions)
print('Test Area Under Roc',evaluator.evaluate(predictions))

Test Area Under Roc 0.8672196520652589

and here is the confusion matrix.

confusion matrix

Since TP=0, how could be that score possible? Could this value be wrong?

I have other models which works fine,but this score makes me wonder if the others are wrong as well.

Upvotes: 2

Views: 4749

Answers (1)

Kunal Narang
Kunal Narang

Reputation: 29

Your data might be heavily biased towards one of the classes, I would recommend using Precision or F-Measure, since it's a better metric in such situations. Try using this:

import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
val metrics = new BinaryClassificationMetrics(predictions)
val f1Score = metrics.fMeasureByThreshold
f1Score.collect.foreach { case (t, f) =>
  println(s"Threshold: $t, F-score: $f, Beta = 1")
}

https://spark.apache.org/docs/latest/api/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.html

Upvotes: -1

Related Questions