Reputation: 1948
I am new to both Spark and Python. I used Spark to train a Logistic Regression model with just two classes (0 and 1). I trained it using my train data frame.
This is how my pipeline model was defined:
# Model definition:
lr = LogisticRegression(featuresCol = "lr_features", labelCol = "targetvar")
# Pipeline definition:
lr_pipeline = Pipeline(stages = indexStages + encodeStages +[lr_assembler, lr])
# Fit the logistic regression model:
lrModel = lr_pipeline.fit(train)
Then I ran predictions using my test dataframe:
lr_predictions = lrModel.transform(test)
Now, my lr_predictions has a column 'probability' that looks like a nested list to me. For example, its first cell contains:
[1,2,[],[0.88,0.11]]
I assume, it means: the probability for class 1 (which is = 0) is 0.88, the probability for class 2 (which is = 1) is 0.11.
By default (threshold = 0.5) this observation is predicted as 0. However, I found a value (bestThreshold) that maximizes the F-measure (in my case it's 0.21):
fMeasure = lr_summary.fMeasureByThreshold
bestThreshold = fMeasure.orderBy(fMeasure['F-Measure'].desc()).first().threshold
I would like to apply bestThreshold to the 'probability' column and get a new column ('pred_new', for example) that contains the class assignments (0 or 1) based on bestThreshold as opposed to 0.5.
I cannot use the code below, because 'probability' column is too complex:
from pyspark.sql.functions import when
lr_predictions = lr_predictions.withColumn("prob_best", \
when(lr_predictions["probability"] >= bestThreshold, 1).otherwise(0)
I feel I need to need to map the 'probability' to a new column based on a new threshold. But I am not sure how to do it - given this complex (for me) structure of the 'probability' column.
Thank you so much for your advice!
Upvotes: 3
Views: 1485
Reputation: 35229
If lrModel
is LogisticRegressionModel
:
type(lrModel)
## pyspark.ml.classification.LogisticRegressionModel
You can use internal Java object to set threshold
lrModel._java_obj.setThreshold(bestThreshold)
and transform:
lrModel.transform(data)
You can do the same to modify rawPredictionCol
, predictionCol
and probabilityCol
.
This should become part of the public API in the future (2.3):
lrModel.transform(data, {lrModel.threshold: bestThreshold})
You can also use UDF:
from pyspark.sql.functions import udf, lit
@udf("integer")
def predict(v, threshold):
return 0 if v[0] >= bestThreshold else 1
lr_predictions.withColumn(
"prob_best",
predict(lr_predictions["probability"], lit(bestThreshold)))
Edit:
With PipelineModel
you can try to access LogisticRegressionModel
(as in your previous question) and do the same thing.
Upvotes: 4