C. Perry
C. Perry

Reputation: 53

PySpark Dynamic When Statement

I have a list of strings I am using to create column names. This list is dynamic and may change over time. Depending on the value of the string the column name changes. An example of the code I currently have is below:

df = df.withColumn("newCol", \
    F.when(df.pet == "dog", df.dog_Column) \
    .otherwise(F.when(df.pet == "cat", df.cat_Column) \
    .otherwise(None))))

I want to return the column that is a derivation of the name in the list. I would like to do something like this instead:

dfvalues = ["dog", "cat", "parrot", "goldfish"]

df = df.withColumn("newCol", F.when(df.pet == dfvalues[0], \
     F.col(dfvalues[0] + "_Column")) 

The issue is that I cannot figure out how to make a looping condition in Pyspark.

Upvotes: 5

Views: 4074

Answers (2)

hamza tuna
hamza tuna

Reputation: 1497

I faced same problem and found this site link.You can use python reduce to looping for clean solution.

from functools import reduce

def update_col(df1, val):
    return df.withColumn('newCol', 
               F.when(F.col('pet') == val, F.col(val+'_column')) \
                .otherwise(F.col('newCol')))

# add empty column
df1 = df.withColumn('newCol', F.lit(0))

reduce(update_col, dfvalues, df1).show()

that yields:

from pyspark.sql import functions as F
dfvalues = ["dog", "cat"]
df = df.withColumn("newCol", F.when(df.pet == dfvalues[0], F.col(dfvalues[0] + "_Column")))

df.show()

+----------+----------+---+------+
|cat_column|dog_column|pet|newCol|
+----------+----------+---+------+
|      cat1|      dog1|dog|  dog1|
|      cat2|      dog2|cat|  cat2|
+----------+----------+---+------+

Upvotes: 1

pault
pault

Reputation: 43544

One way could be to use a list comprehension in conjuction with a coalesce, very similiar to the answer here.

mycols = [F.when(F.col("pet") == p, F.col(p + "_Column")) for p in dfvalues]
df = df.select("*", F.coalesce(*mycols).alias("newCol"))

This works because when() will return None if the is no otherwise(), and coalesce() will pick the first non-null column.

Upvotes: 8

Related Questions