Reputation: 21
I have a dataset with 2 classes (churners and non-churners) in the ratio 1:4. I used Random Forests algorithm via Spark MLlib. My model is terrible at predicting churn class and does nothing. I use BinaryClassificationEvaluator to evaluate my model in Pyspark. The default metric for the BinaryClassificationEvaluator is AreaUnderRoc.
My code
from pyspark.ml.classification import RandomForestClassifier
evaluator = BinaryClassificationEvaluator()
# Create an initial RandomForest model.
rf = RandomForestClassifier(labelCol="label", featuresCol="indexedFeatures", numTrees=1000,impurity="entropy")
# Train model with Training Data
rfModel = rf.fit(train_df)
rfModel.featureImportances
# Make predictions on test data using the Transformer.transform() method.
predictions = rfModel.transform(test_df)
# AUC Evaluate best model
evaluator.evaluate(predictions)
print('Test Area Under Roc',evaluator.evaluate(predictions))
Test Area Under Roc 0.8672196520652589
and here is the confusion matrix.
Since TP=0, how could be that score possible? Could this value be wrong?
I have other models which works fine,but this score makes me wonder if the others are wrong as well.
Upvotes: 2
Views: 4749
Reputation: 29
Your data might be heavily biased towards one of the classes, I would recommend using Precision or F-Measure, since it's a better metric in such situations. Try using this:
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
val metrics = new BinaryClassificationMetrics(predictions)
val f1Score = metrics.fMeasureByThreshold
f1Score.collect.foreach { case (t, f) =>
println(s"Threshold: $t, F-score: $f, Beta = 1")
}
Upvotes: -1