Reputation: 111
Is there a built-in log loss function in pyspark?
I have a pyspark dataframe with columns: probability, rawPrediction, label
and I want to use mean log loss to evaluate these predictions.
Upvotes: 6
Views: 2650
Reputation: 7755
No such function exists directly, as far as I can tell. But given a PySpark dataframe df
with the columns named as in the question, one can explicitly calculate the average log loss:
import pyspark.sql.functions as f
eps = 1e-7
df_w_logloss = (
df
# clip predictions to within [eps, 1-eps] to prevent infinities
.withColumn('prob_eps', f.least(f.greatest(f.col('probability'), f.lit(eps)), f.lit(1-eps)))
.withColumn(
'logloss',
-f.col('label')*f.log(f.col('prob_eps')) - (1.-f.col('label'))*f.log(1.-f.col('prob_eps')),
)
)
logloss = df_w_logloss.agg(f.mean('logloss').alias('ll')).collect()[0]['ll']
I'm assuming here that label
is numerical (i.e. 0 or 1), and that probability
represents the predictions of the model. (Not sure what rawPrediction
might mean.)
Upvotes: 8
Reputation: 96
I am assuming that you are using Spark ML and that your dataframe is the output of a fitted estimator. In this case, you can use the following function to calculate the log loss.
from pyspark.sql.types import FloatType
from pyspark.sql import functions as F
def log_loss(df):
# extract predicted probability
get_first_element = F.udf(lambda v:float(v[1]), FloatType())
df = df.withColumn("preds", get_first_element(F.col("probability")))
# calculate negative log-likelihood
y1, y0 = F.col("label"), 1 - F.col("label")
p1, p0 = F.col("preds"), 1 - F.col("preds")
nll = -(y1*F.log(p1) + y0*F.log(p0))
# aggregate
return df.agg(F.mean(nll)).collect()[0][0]
Upvotes: 1