Reputation: 23
I am trying to update 3 columns based on text in a fourth column. Each updated column will have a different text.
For example, if the FreeText column has a value that falls into the category of a column, I want to change the column value to "1", the EditedCol column to the name of the column edited, and Match to "1".
Here is the code I am currently using. The parenthesis may be off because I wrote this up quickly, but that is not part of the issue. Is there a way to make it more efficient?
myList1="horse"
myList2="cheese"
myList3="soccer"
df=df.withColumn("Animal", F.when(F.col("FreeText).rlike(myList1), "Yes").otherwise(F.col("Animal"))
).withColumn("EditedCol", F.when(F.col("FreeText).rlike(myList1), "Animal").otherwise(F.col("EditedCol"))
).withColumn("Match", F.when(F.col("FreeText).rlike(myList1), "Yes").otherwise(F.col("Match")))
df=df.withColumn("Food", F.when(F.col("FreeText).rlike(myList2), "Yes").otherwise(F.col("Food"))
).withColumn("EditedCol", F.when(F.col("FreeText).rlike(myList2), "Food").otherwise(F.col("EditedCol"))
).withColumn("Match", F.when(F.col("FreeText).rlike(myList2), "Yes").otherwise(F.col("Match")))
df=df.withColumn("Sport", F.when(F.col("FreeText).rlike(myList3), "Yes").otherwise(F.col("Sport"))
).withColumn("EditedCol", F.when(F.col("FreeText).rlike(myList3), "Sport").otherwise(F.col("EditedCol"))
).withColumn("Match", F.when(F.col("FreeText).rlike(myList3), "Yes").otherwise(F.col("Match")))
Edit: Thank you, the following code worked:
myList = ["horse|donkey|monkey", "cheese|mango"]
myCols = ["Animals", "Food"]
for element, col in zip(myList, myCols):
df = df.withColumn(col, F.when(F.col("FreeText").rlike(element), "Yes"
).otherwise(F.col(col))
).withColumn("EditedCol", F.when(F.col("FreeText").rlike(element), col
).otherwise(F.col("EditedCol"))
).withColumn("Match", F.when(F.col("FreeText").rlike(element), "Yes"
).otherwise(F.col("Match")))
Now for the "EditedCol" variable, I am trying to concatenate the variable list if the keywords in "FreeText" match to more than 1 variable. For my code below, for some reason instead of listing the 2nd variable it just says "Yes". Please advise:
myList = ["horse|donkey|monkey", "cheese|mango"]
myCols = ["Animals", "Food"]
for element, col in zip(myList, myCols):
df = df.withColumn(col, F.when(F.col("FreeText").rlike(element), "Yes"
).otherwise(F.col(col))
).withColumn("EditedCol", F.when((F.col("Overwrite_race_spec").isNull() & F.col("FreeText").rlike(element)), col
).otherwise(F.when(F.col("FreeText").rlike(element), F.concat(("EditedCol"), F.lit("; "), col)
).otherwise(F.col("EditedCol")))
).withColumn("Match", F.when(F.col("FreeText").rlike(element), "Yes"
).otherwise(F.col("Match")))
Upvotes: 1
Views: 3532
Reputation: 607
If by efficiency you meant code efficiency, then following will be more automated.
myList = ["horse", "cheese", "soccer"]
myCols = ["Animal", "Food", "Match"]
for element, col in zip(myList, col):
df = df.withColumn(col, F.when(F.col("FreeText).rlike(element), "Yes").otherwise(F.col(col))) \
.withColumn("EditedCol", F.when(F.col("FreeText).rlike(element), col).otherwise(F.col("EditedCol"))) \
.withColumn("Match", F.when(F.col("FreeText).rlike(element), "Yes").otherwise(F.col("Match")))
Upvotes: 1