Pyspark - Calculate Median Absolute Percent Error with groupby

Question

I'm able to calculate the Median Absolute Error with this function:

from pyspark.sql import Window

def compute_Median_Abs_Err(df, expected_col, actual_col):

    grp_window = Window.partitionBy('grp')

    magic_percentile = F.expr('percentile_approx(abserror, 0.5)')

    med_abs_err = df.withColumn("abserror",
                           f.abs(f.col(actual_col) - f.col(expected_col)
                           )).groupby('start_month', 'start_dt'
                                     ).agg(magic_percentile.alias("med_abs_error")
                                          )

    return(med_abs_err)

Which can be calculate with this equation:

MEDIAN(abs(predictions - actuals))

I'd like to be able to calculate the Median Absolute Percent Error, calculated with this equation:

MEDIAN( abs(predictions - actuals) / actuals )

I thought I had it correctly with this:

from pyspark.sql import Window

def compute_Median_Perc_Err(df, expected_col, actual_col):

    grp_window = Window.partitionBy('grp')

    magic_percentile = f.expr('percentile_approx(abserror, 0.5)')

    med_perc_err = df.withColumn("abserror",
                           f.abs(f.col(actual_col) - f.col(expected_col)
                           )).groupby('start_month', 'start_dt'
                                     ).agg(magic_percentile.alias("med_abs_error"), f.avg(f.col(actual_col)).alias("mean")
                                          ).withColumn("med_perc_error", f.col("med_abs_error") / f.col("mean"))


    return(med_perc_err)

But I realized with this, I am not dividing by the actuals before taking the median. I should divide by the actuals first, then take the median of that column.

How do I write this code snippet to divide by the actuals first, since I still need to take .agg(f.avg(f.col("actuals")) after the groupby to get an accurate mean?

linog · Accepted Answer

I think you were almost there. In the spirit of the function proposed here, median absolute percentage can be computed as follows:

import pyspark.sql.functions as psf
import pyspark.sql.Window as psw

def compute_mape(df, expected_col, actual_col):

  grp_window = psw.Window.partitionBy('grp')
  magic_percentile = psf.expr('percentile_approx(relerror, 0.5)')

  mape = df.withColumn("abserror",
                      psf.col(actual_col) - psf.col(expected_col))
          .withColumn("relerror", 
                      psf.abs(psf.col("abserror")/psf.col(actual_col)))
          .groupBy('start_month','start_dt')
          .agg(magic_percentile.alias("med_perc_error"))

  return(mape)


compute_mape("col1", "col2")

(here the acronym MAPE is not as usual for mean absolute percentage error)

Note: I transformed the pyspark.sql.functions alias from f.* to psf.* and added psẁ alias forpyspark.sql.Window`. Everywhere else, I sticked to your notations.

Pyspark - Calculate Median Absolute Percent Error with groupby

Answers (1)

Related Questions