gillygangles
gillygangles

Reputation: 23

Update multiple columns based on the same list in PySpark dataframes

I am trying to update 3 columns based on text in a fourth column. Each updated column will have a different text.

For example, if the FreeText column has a value that falls into the category of a column, I want to change the column value to "1", the EditedCol column to the name of the column edited, and Match to "1".

Here is the code I am currently using. The parenthesis may be off because I wrote this up quickly, but that is not part of the issue. Is there a way to make it more efficient?

myList1="horse"
myList2="cheese"
myList3="soccer"

df=df.withColumn("Animal", F.when(F.col("FreeText).rlike(myList1), "Yes").otherwise(F.col("Animal"))
    ).withColumn("EditedCol", F.when(F.col("FreeText).rlike(myList1), "Animal").otherwise(F.col("EditedCol"))
    ).withColumn("Match", F.when(F.col("FreeText).rlike(myList1), "Yes").otherwise(F.col("Match")))

df=df.withColumn("Food", F.when(F.col("FreeText).rlike(myList2), "Yes").otherwise(F.col("Food"))
    ).withColumn("EditedCol", F.when(F.col("FreeText).rlike(myList2), "Food").otherwise(F.col("EditedCol"))
    ).withColumn("Match", F.when(F.col("FreeText).rlike(myList2), "Yes").otherwise(F.col("Match")))

df=df.withColumn("Sport", F.when(F.col("FreeText).rlike(myList3), "Yes").otherwise(F.col("Sport"))
    ).withColumn("EditedCol", F.when(F.col("FreeText).rlike(myList3), "Sport").otherwise(F.col("EditedCol"))
    ).withColumn("Match", F.when(F.col("FreeText).rlike(myList3), "Yes").otherwise(F.col("Match")))

Edit: Thank you, the following code worked:

myList = ["horse|donkey|monkey", "cheese|mango"]
myCols = ["Animals", "Food"]

for element, col in zip(myList, myCols):
    df = df.withColumn(col, F.when(F.col("FreeText").rlike(element), "Yes"
        ).otherwise(F.col(col))
    ).withColumn("EditedCol", F.when(F.col("FreeText").rlike(element), col
        ).otherwise(F.col("EditedCol"))
    ).withColumn("Match", F.when(F.col("FreeText").rlike(element), "Yes"
        ).otherwise(F.col("Match")))

Now for the "EditedCol" variable, I am trying to concatenate the variable list if the keywords in "FreeText" match to more than 1 variable. For my code below, for some reason instead of listing the 2nd variable it just says "Yes". Please advise:

myList = ["horse|donkey|monkey", "cheese|mango"]
myCols = ["Animals", "Food"]

for element, col in zip(myList, myCols):
    df = df.withColumn(col, F.when(F.col("FreeText").rlike(element), "Yes"
        ).otherwise(F.col(col))
    ).withColumn("EditedCol", F.when((F.col("Overwrite_race_spec").isNull() & F.col("FreeText").rlike(element)), col
        ).otherwise(F.when(F.col("FreeText").rlike(element), F.concat(("EditedCol"), F.lit("; "), col)
        ).otherwise(F.col("EditedCol")))
    ).withColumn("Match", F.when(F.col("FreeText").rlike(element), "Yes"
        ).otherwise(F.col("Match")))

Upvotes: 1

Views: 3532

Answers (1)

Prateek Jain
Prateek Jain

Reputation: 607

If by efficiency you meant code efficiency, then following will be more automated.

myList = ["horse", "cheese", "soccer"]
myCols = ["Animal", "Food", "Match"]

for element, col in zip(myList, col):
    df = df.withColumn(col, F.when(F.col("FreeText).rlike(element), "Yes").otherwise(F.col(col))) \
            .withColumn("EditedCol", F.when(F.col("FreeText).rlike(element), col).otherwise(F.col("EditedCol"))) \
            .withColumn("Match", F.when(F.col("FreeText).rlike(element), "Yes").otherwise(F.col("Match")))

Upvotes: 1

Related Questions