Deb
Deb

Reputation: 121

Alternative way to get results faster without applying UDF

I have this udf which returns alert severity based on certain conditions.

def alert(name, X_Request, X_Actual):
    if "Impact" in name:
        return "Highest"

    fs = ['FS_00','FS_01','FS_02','FS_03']
    if name in fs:
      if X_Actual < -3:
          return "High"
      elif X_Actual <= -2 and X_Actual >= -3 :
          return "Medium"
      elif X_Actual > -3 and X_Actual <= -0.5:
          return "Low"
    return None

alert_type = udf(alert, StringType())


df.withColumn("alert_level", alert_type(df["name"],df["x_request"],df["x_actual"]))

Can this be done without applying UDF, as UDF slows down the performance?

Upvotes: 1

Views: 66

Answers (1)

samkart
samkart

Reputation: 6654

A when().otherwise() like the following should work.

import pyspark.sql.functions as func
fs = ['FS_00','FS_01','FS_02','FS_03']

data_sdf. \
    withColumn('alert_level',
               func.when(func.upper(func.col('name')).like('%IMPACT%'), func.lit('Highest')).
               when(func.col('name').isin(fs), 
                    func.when(func.col('x_actual') < -3, func.lit('High')).
                    when(func.col('x_actual').between(-3, -2), func.lit('Medium')).
                    when(func.col('x_actual').between(-2, -0.5), func.lit('Low'))
                    )
               )

Note - .between() includes the bounds provided to it (it's a >= & <=). But in this case it is safe due to the when() order of execution (anything =2 will already be considered in "Medium"). Any other value that does not match the conditions will result in a null as no otherwise() condition was provided.

Upvotes: 1

Related Questions