sjishan
sjishan

Reputation: 3672

PySpark: multiple conditions in when clause

I would like to modify the cell values of a dataframe column (Age) where currently it is blank and I would only do it if another column (Survived) has the value 0 for the corresponding row where it is blank for Age. If it is 1 in the Survived column but blank in Age column then I will keep it as null.

I tried to use && operator but it didn't work. Here is my code:

tdata.withColumn("Age",  when((tdata.Age == "" && tdata.Survived == "0"), mean_age_0).otherwise(tdata.Age)).show()

Any suggestions how to handle that? Thanks.

Error Message:

SyntaxError: invalid syntax
  File "<ipython-input-33-3e691784411c>", line 1
    tdata.withColumn("Age",  when((tdata.Age == "" && tdata.Survived == "0"), mean_age_0).otherwise(tdata.Age)).show()
                                                    ^

Upvotes: 74

Views: 315982

Answers (5)

user238607
user238607

Reputation: 2468

Also make sure the order of the conditions also matter.

RIGHT : Restrictive condition is after the relaxed codition.

ultimate_optimized_join = spark_filteredfinal_df1.crossJoin(spark_filteredfinal_df2) \
                        .where( (F.col("df1_sorted_row_num") < F.col("df2_sorted_row_num"))                                                          
                                &
                                (F.col("df1_a_x_dup") == F.col("df2_a_x_dup"))
                                ) \
                        .select(["df1_a_x_dup", "df2_a_x_dup", "df1_sorted_row_num", "df1_sorted_row_num",  "df1_collectedIds", "df2_collectedIds"])

WRONG : Restrictive condition in where clause is before relaxed condition.

ultimate_optimized_join = spark_filteredfinal_df1.crossJoin(spark_filteredfinal_df2) \
                        .where( (F.col("df1_a_x_dup") == F.col("df2_a_x_dup"))
                                &
                                (F.col("df1_sorted_row_num") < F.col("df2_sorted_row_num"))
                                ) \
                        .select(["df1_a_x_dup", "df2_a_x_dup", "df1_sorted_row_num", "df1_sorted_row_num",  "df1_collectedIds", "df2_collectedIds"])

Upvotes: 0

vj sreenivasan
vj sreenivasan

Reputation: 1343

when in pyspark multiple conditions can be built using &(for and) and | (for or).

Note:In pyspark t is important to enclose every expressions within parenthesis () that combine to form the condition

%pyspark
dataDF = spark.createDataFrame([(66, "a", "4"), 
                                (67, "a", "0"), 
                                (70, "b", "4"), 
                                (71, "d", "4")],
                                ("id", "code", "amt"))
dataDF.withColumn("new_column",
       when((col("code") == "a") | (col("code") == "d"), "A")
      .when((col("code") == "b") & (col("amt") == "4"), "B")
      .otherwise("A1")).show()

In Spark Scala code (&&) or (||) conditions can be used within when function

//scala
val dataDF = Seq(
      (66, "a", "4"), (67, "a", "0"), (70, "b", "4"), (71, "d", "4"
      )).toDF("id", "code", "amt")
dataDF.withColumn("new_column",
       when(col("code") === "a" || col("code") === "d", "A")
      .when(col("code") === "b" && col("amt") === "4", "B")
      .otherwise("A1")).show()

=======================

Output:
+---+----+---+----------+
| id|code|amt|new_column|
+---+----+---+----------+
| 66|   a|  4|         A|
| 67|   a|  0|         A|
| 70|   b|  4|         B|
| 71|   d|  4|         A|
+---+----+---+----------+

This code snippet is copied from sparkbyexamples.com

Upvotes: 37

Jose Alberto Gonzalez
Jose Alberto Gonzalez

Reputation: 81

it should works at least in pyspark 2.4

tdata = tdata.withColumn("Age",  when((tdata.Age == "") & (tdata.Survived == "0") , "NewValue").otherwise(tdata.Age))

Upvotes: 4

mahatmawx
mahatmawx

Reputation: 9

It should be:

$when(((tdata.Age == "" ) & (tdata.Survived == "0")), mean_age_0)

Upvotes: -1

zero323
zero323

Reputation: 330283

You get SyntaxError error exception because Python has no && operator. It has and and & where the latter one is the correct choice to create boolean expressions on Column (| for a logical disjunction and ~ for logical negation).

Condition you created is also invalid because it doesn't consider operator precedence. & in Python has a higher precedence than == so expression has to be parenthesized.

(col("Age") == "") & (col("Survived") == "0")
## Column<b'((Age = ) AND (Survived = 0))'>

On a side note when function is equivalent to case expression not WHEN clause. Still the same rules apply. Conjunction:

df.where((col("foo") > 0) & (col("bar") < 0))

Disjunction:

df.where((col("foo") > 0) | (col("bar") < 0))

You can of course define conditions separately to avoid brackets:

cond1 = col("Age") == "" 
cond2 = col("Survived") == "0"

cond1 & cond2

Upvotes: 152

Related Questions