evaluator = BinaryClassificationEvaluator() grid = ParamGridBuilder().build() # no hyper parameter optimization cv = CrossValidator(estimator=pipeline, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset) evaluator.evaluate(cvModel.transform(dataset)) Returns: cvModel.avgMetrics = [1.602872634746238] evaluator.evaluate(cvModel.transform(dataset)) = 0.7267754950388204 Questions: How can avgMetric be larger than 1 (1.6) if it is area under ROC? Is the scheme evaluator.evaluate(cvModel.transform(dataset)) actually returns the training metric and not the cross validation metric? (we used dataset both for fit and evaluate)

Reputation: 9078

Why does pyspark's BinaryClassificationEvaluator avgMetrics returns a value larger than one?

    evaluator = BinaryClassificationEvaluator()
    grid = ParamGridBuilder().build()  # no hyper parameter optimization
    cv = CrossValidator(estimator=pipeline, estimatorParamMaps=grid, evaluator=evaluator)
    cvModel = cv.fit(dataset)
    evaluator.evaluate(cvModel.transform(dataset))

Returns:

cvModel.avgMetrics = [1.602872634746238]
evaluator.evaluate(cvModel.transform(dataset)) = 0.7267754950388204

Questions:

How can avgMetric be larger than 1 (1.6) if it is area under ROC?
Is the scheme evaluator.evaluate(cvModel.transform(dataset)) actually returns the training metric and not the cross validation metric? (we used dataset both for fit and evaluate)

Upvotes: 3

Answers (1)

shuaiyuancn

Reputation: 2794

It is a bug that has been fixed recently. However, it is not released yet.

Based on what you provided, I used the following code to replicate the issue:

from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.linalg import Vectors
from pyspark.sql.types import Row

dataset = sc.parallelize([
    Row(features=Vectors.dense([1., 0.]), label=1.),
    Row(features=Vectors.dense([1., 1.]), label=0.),
    Row(features=Vectors.dense([0., 0.]), label=1.),
]).toDF()

evaluator = BinaryClassificationEvaluator(metricName="areaUnderROC")
grid = ParamGridBuilder().addGrid('maxIter', [100, 10]).build()  # no hyper parameter optimization
cv = CrossValidator(estimator=LogisticRegression(), estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
evaluator.evaluate(cvModel.transform(dataset))

Out[23]: 1.0

cvModel.avgMetrics

Out[34]: [2.0, 2.0]

Simply put,

avgMetrics was summed, not averaged, across folds

EDIT:

About the second question, the easiest way to validate is to supply a test dataset:

to_test = sc.parallelize([
    Row(features=Vectors.dense([1., 0.]), label=1.),
    Row(features=Vectors.dense([1., 1.]), label=0.),
    Row(features=Vectors.dense([0., 1.]), label=1.),
]).toDF()

evaluator.evaluate(cvModel.transform(to_test))

Out[2]: 0.5

It confirms the function call returns the metrics on the test dataset.

Upvotes: 5

Why does pyspark&#39;s BinaryClassificationEvaluator avgMetrics returns a value larger than one?

Answers (1)

Related Questions

Why does pyspark's BinaryClassificationEvaluator avgMetrics returns a value larger than one?