Reputation: 285
It does not appear that Pyspark Onv-vs-Rest classifier provides probabilities. Is there a way to do this?
I am appending code below. I am adding the standard multiclass classifier for comparison.
from pyspark.ml.classification import LogisticRegression, OneVsRest
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# load data file.
inputData = spark.read.format("libsvm") \
.load("/data/mllib/sample_multiclass_classification_data.txt")
(train, test) = inputData.randomSplit([0.8, 0.2])
# instantiate the base classifier.
lr = LogisticRegression(maxIter=10, tol=1E-6, fitIntercept=True)
# instantiate the One Vs Rest Classifier.
ovr = OneVsRest(classifier=lr)
# train the multiclass model.
ovrModel = ovr.fit(train)
lrm = lr.fit(train)
# score the model on test data.
predictions = ovrModel.transform(test)
predictions2 = lrm.transform(test)
predictions.show(6)
predictions2.show(6)
Upvotes: 0
Views: 1136
Reputation: 1737
I don't think you can access the probabilities(confidence) vector because it takes the max of the confidence and drops the confidence vector. To test, you can make a copy of the class and modify it and remove the .drop(accColName)
http://spark.apache.org/docs/2.0.1/api/python/_modules/pyspark/ml/classification.html
# output the index of the classifier with highest confidence as prediction
labelUDF = udf(
lambda predictions: float(max(enumerate(predictions), key=operator.itemgetter(1))[0]),
DoubleType())
# output label and label metadata as prediction
return aggregatedDataset.withColumn(
self.getPredictionCol(), labelUDF(aggregatedDataset[accColName])).drop(accColName)
Upvotes: 1