Reputation: 1584
How to update a column in Pyspark dataframe with a where clause?
This is similar to this SQL operation :
UPDATE table1 SET alpha1= x WHERE alpha2< 6;
where alpha1 and alpha2 are columns of the table1.
For Eg : I Have a dataframe table1 with values below :
table1 alpha1 alpha2 3 7 4 5 5 4 6 8 dataframe Table1 after update : alpha1 alpha2 3 7 x 5 x 4 6 8
How to do this in pyspark dataframe?
Upvotes: 2
Views: 4482
Reputation: 13001
You are looking for the when function:
df = spark.createDataFrame([("3",7),("4",5),("5",4),("6",8)],["alpha1", "alpha2"])
df.show()
>>> +------+------+
>>> |alpha1|alpha2|
>>> +------+------+
>>> | 3| 7|
>>> | 4| 5|
>>> | 5| 4|
>>> | 6| 8|
>>> +------+------+
df2 = df.withColumn("alpha1", pyspark.sql.functions.when(df["alpha2"] < 6, "x").otherwise(df["alpha1"]))
df2.show()
>>>+------+------+
>>>|alpha1|alpha2|
>>>+------+------+
>>>| 3| 7|
>>>| x| 5|
>>>| x| 4|
>>>| 6| 8|
>>>+------+------+
Upvotes: 7