Reputation: 9078
evaluator = BinaryClassificationEvaluator()
grid = ParamGridBuilder().build() # no hyper parameter optimization
cv = CrossValidator(estimator=pipeline, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
evaluator.evaluate(cvModel.transform(dataset))
Returns:
cvModel.avgMetrics = [1.602872634746238]
evaluator.evaluate(cvModel.transform(dataset)) = 0.7267754950388204
Questions:
dataset
both for fit and evaluate)Upvotes: 3
Views: 5714
Reputation: 2794
It is a bug that has been fixed recently. However, it is not released yet.
Based on what you provided, I used the following code to replicate the issue:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.linalg import Vectors
from pyspark.sql.types import Row
dataset = sc.parallelize([
Row(features=Vectors.dense([1., 0.]), label=1.),
Row(features=Vectors.dense([1., 1.]), label=0.),
Row(features=Vectors.dense([0., 0.]), label=1.),
]).toDF()
evaluator = BinaryClassificationEvaluator(metricName="areaUnderROC")
grid = ParamGridBuilder().addGrid('maxIter', [100, 10]).build() # no hyper parameter optimization
cv = CrossValidator(estimator=LogisticRegression(), estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
evaluator.evaluate(cvModel.transform(dataset))
Out[23]: 1.0
cvModel.avgMetrics
Out[34]: [2.0, 2.0]
Simply put,
avgMetrics
was summed, not averaged, across folds
EDIT:
About the second question, the easiest way to validate is to supply a test dataset:
to_test = sc.parallelize([
Row(features=Vectors.dense([1., 0.]), label=1.),
Row(features=Vectors.dense([1., 1.]), label=0.),
Row(features=Vectors.dense([0., 1.]), label=1.),
]).toDF()
evaluator.evaluate(cvModel.transform(to_test))
Out[2]: 0.5
It confirms the function call returns the metrics on the test dataset.
Upvotes: 5