Reputation: 53
I have a list of strings I am using to create column names. This list is dynamic and may change over time. Depending on the value of the string the column name changes. An example of the code I currently have is below:
df = df.withColumn("newCol", \
F.when(df.pet == "dog", df.dog_Column) \
.otherwise(F.when(df.pet == "cat", df.cat_Column) \
.otherwise(None))))
I want to return the column that is a derivation of the name in the list. I would like to do something like this instead:
dfvalues = ["dog", "cat", "parrot", "goldfish"]
df = df.withColumn("newCol", F.when(df.pet == dfvalues[0], \
F.col(dfvalues[0] + "_Column"))
The issue is that I cannot figure out how to make a looping condition in Pyspark.
Upvotes: 5
Views: 4074
Reputation: 1497
I faced same problem and found this site link.You can use python reduce to looping for clean solution.
from functools import reduce
def update_col(df1, val):
return df.withColumn('newCol',
F.when(F.col('pet') == val, F.col(val+'_column')) \
.otherwise(F.col('newCol')))
# add empty column
df1 = df.withColumn('newCol', F.lit(0))
reduce(update_col, dfvalues, df1).show()
that yields:
from pyspark.sql import functions as F
dfvalues = ["dog", "cat"]
df = df.withColumn("newCol", F.when(df.pet == dfvalues[0], F.col(dfvalues[0] + "_Column")))
df.show()
+----------+----------+---+------+
|cat_column|dog_column|pet|newCol|
+----------+----------+---+------+
| cat1| dog1|dog| dog1|
| cat2| dog2|cat| cat2|
+----------+----------+---+------+
Upvotes: 1
Reputation: 43544
One way could be to use a list comprehension in conjuction with a coalesce
, very similiar to the answer here.
mycols = [F.when(F.col("pet") == p, F.col(p + "_Column")) for p in dfvalues]
df = df.select("*", F.coalesce(*mycols).alias("newCol"))
This works because when()
will return None
if the is no otherwise()
, and coalesce()
will pick the first non-null column.
Upvotes: 8