Reputation: 65
I have a dataframe that creates a new column based on a reduction calculation of existing columns. I need to make a check that if the reduction value used is higher than a particular threshold number, then it should be made equal to the threshold number/ahould not exceed it.
I've tried wrapping a when statement within and after the .withColumn statement
df = df.withColumn('total_new_load',
col('existing_load') * (5 - col('tot_reduced_load')))
Basically I need to add an if-statement of some sort in a pyspark syntax relating to my dataframe code, such as:
if tot_reduced_load > 50
then
tot_reduced_load = 50
Upvotes: 1
Views: 1453
Reputation: 4099
Try this -
Sample Data:
df = spark.createDataFrame([(1,30),(2,40),(3,60)],['row_id','tot_reduced_load'])
df.show()
#+------+----------------+
#|row_id|tot_reduced_load|
#+------+----------------+
#| 1| 30|
#| 2| 40|
#| 3| 60|
#+------+----------------+
withColumn
from pyspark.sql import functions as psf
tot_reduced_load_new = psf.when(psf.col("tot_reduced_load") > 50 , 50).otherwise(psf.col("tot_reduced_load"))
df.withColumn("tot_reduced_load_new",tot_reduced_load_new ).show()
#+------+----------------+--------------------+
#|row_id|tot_reduced_load|tot_reduced_load_new|
#+------+----------------+--------------------+
#| 1| 30| 30|
#| 2| 40| 40|
#| 3| 60| 50|
#+------+----------------+--------------------+
selectExpr
df.selectExpr("*","CASE WHEN tot_reduced_load > 50 THEN 50 ELSE tot_reduced_load END AS tot_reduced_load_new").show()
#+------+----------------+--------------------+
#|row_id|tot_reduced_load|tot_reduced_load_new|
#+------+----------------+--------------------+
#| 1| 30| 30|
#| 2| 40| 40|
#| 3| 60| 50|
#+------+----------------+--------------------+
Upvotes: 0
Reputation: 2231
Try this
from pyspark.sql import functions as F
df.withColumn("tot_reduced_load ", F.when(F.col("tot_reduced_load")>50,50)).otherwise(F.col("tot_reduced_load"))
Upvotes: 2