PySpark Dynamic When Statement

Question

I have a list of strings I am using to create column names. This list is dynamic and may change over time. Depending on the value of the string the column name changes. An example of the code I currently have is below:

df = df.withColumn("newCol", \
    F.when(df.pet == "dog", df.dog_Column) \
    .otherwise(F.when(df.pet == "cat", df.cat_Column) \
    .otherwise(None))))

I want to return the column that is a derivation of the name in the list. I would like to do something like this instead:

dfvalues = ["dog", "cat", "parrot", "goldfish"]

df = df.withColumn("newCol", F.when(df.pet == dfvalues[0], \
     F.col(dfvalues[0] + "_Column"))

The issue is that I cannot figure out how to make a looping condition in Pyspark.

pault · Accepted Answer

One way could be to use a list comprehension in conjuction with a coalesce, very similiar to the answer here.

mycols = [F.when(F.col("pet") == p, F.col(p + "_Column")) for p in dfvalues]
df = df.select("*", F.coalesce(*mycols).alias("newCol"))

This works because when() will return None if the is no otherwise(), and coalesce() will pick the first non-null column.

PySpark Dynamic When Statement

Answers (2)

Related Questions