Billie AK
Billie AK

Reputation: 65

How to do a check/try-catch to a pyspark dataframe?

I have a dataframe that creates a new column based on a reduction calculation of existing columns. I need to make a check that if the reduction value used is higher than a particular threshold number, then it should be made equal to the threshold number/ahould not exceed it.

I've tried wrapping a when statement within and after the .withColumn statement

df = df.withColumn('total_new_load',
                     col('existing_load') * (5 - col('tot_reduced_load')))

Basically I need to add an if-statement of some sort in a pyspark syntax relating to my dataframe code, such as:

  if tot_reduced_load > 50 
  then 
  tot_reduced_load = 50

Upvotes: 1

Views: 1453

Answers (2)

Shantanu Sharma
Shantanu Sharma

Reputation: 4099

Try this -

Sample Data:

df = spark.createDataFrame([(1,30),(2,40),(3,60)],['row_id','tot_reduced_load'])

df.show()

#+------+----------------+
#|row_id|tot_reduced_load|
#+------+----------------+
#|     1|              30|
#|     2|              40|
#|     3|              60|
#+------+----------------+

Option1: withColumn

from pyspark.sql import functions as psf

tot_reduced_load_new  = psf.when(psf.col("tot_reduced_load") > 50 , 50).otherwise(psf.col("tot_reduced_load"))

df.withColumn("tot_reduced_load_new",tot_reduced_load_new ).show()

#+------+----------------+--------------------+
#|row_id|tot_reduced_load|tot_reduced_load_new|
#+------+----------------+--------------------+
#|     1|              30|                  30|
#|     2|              40|                  40|
#|     3|              60|                  50|
#+------+----------------+--------------------+

Option2: selectExpr

df.selectExpr("*","CASE WHEN tot_reduced_load > 50 THEN 50 ELSE tot_reduced_load END AS tot_reduced_load_new").show()

#+------+----------------+--------------------+
#|row_id|tot_reduced_load|tot_reduced_load_new|
#+------+----------------+--------------------+
#|     1|              30|                  30|
#|     2|              40|                  40|
#|     3|              60|                  50|
#+------+----------------+--------------------+

Upvotes: 0

Manrique
Manrique

Reputation: 2231

Try this

from pyspark.sql import functions as F
df.withColumn("tot_reduced_load ", F.when(F.col("tot_reduced_load")>50,50)).otherwise(F.col("tot_reduced_load"))

Upvotes: 2

Related Questions