Reputation: 1939
I'm able to calculate the Median Absolute Error with this function:
from pyspark.sql import Window
def compute_Median_Abs_Err(df, expected_col, actual_col):
grp_window = Window.partitionBy('grp')
magic_percentile = F.expr('percentile_approx(abserror, 0.5)')
med_abs_err = df.withColumn("abserror",
f.abs(f.col(actual_col) - f.col(expected_col)
)).groupby('start_month', 'start_dt'
).agg(magic_percentile.alias("med_abs_error")
)
return(med_abs_err)
Which can be calculate with this equation:
MEDIAN(abs(predictions - actuals))
I'd like to be able to calculate the Median Absolute Percent Error, calculated with this equation:
MEDIAN( abs(predictions - actuals) / actuals )
I thought I had it correctly with this:
from pyspark.sql import Window
def compute_Median_Perc_Err(df, expected_col, actual_col):
grp_window = Window.partitionBy('grp')
magic_percentile = f.expr('percentile_approx(abserror, 0.5)')
med_perc_err = df.withColumn("abserror",
f.abs(f.col(actual_col) - f.col(expected_col)
)).groupby('start_month', 'start_dt'
).agg(magic_percentile.alias("med_abs_error"), f.avg(f.col(actual_col)).alias("mean")
).withColumn("med_perc_error", f.col("med_abs_error") / f.col("mean"))
return(med_perc_err)
But I realized with this, I am not dividing by the actuals
before taking the median
. I should divide by the actuals first, then take the median of that column.
How do I write this code snippet to divide by the actuals first, since I still need to take .agg(f.avg(f.col("actuals"))
after the groupby to get an accurate mean?
Upvotes: 1
Views: 3111
Reputation: 6226
I think you were almost there. In the spirit of the function proposed here, median absolute percentage can be computed as follows:
import pyspark.sql.functions as psf
import pyspark.sql.Window as psw
def compute_mape(df, expected_col, actual_col):
grp_window = psw.Window.partitionBy('grp')
magic_percentile = psf.expr('percentile_approx(relerror, 0.5)')
mape = df.withColumn("abserror",
psf.col(actual_col) - psf.col(expected_col))
.withColumn("relerror",
psf.abs(psf.col("abserror")/psf.col(actual_col)))
.groupBy('start_month','start_dt')
.agg(magic_percentile.alias("med_perc_error"))
return(mape)
compute_mape("col1", "col2")
(here the acronym MAPE is not as usual for mean absolute percentage error)
Note: I transformed the pyspark.sql.functions
alias from f.*
to psf.*
and added psẁ alias for
pyspark.sql.Window`. Everywhere else, I sticked to your notations.
Upvotes: 1