Reputation: 89
I need to add new column to my DF based on combination of two different conditions, while one of them is the same and the other is different for each case:
if operation is 'sync' and message contains 'deleted' - then deleted
if operation is 'sync' and message contains 'updated' - then updated
if operation is 'sync' and message contains 'created' - then created
import pyspark.sql.functions as f
operation = "SYNC"
deleted = "was deleted: true"
created = "was created: true"
updated = "was updated: true"
df1 = df.withColumn(
"synced",
f.when(
f.col("operation").contains(operation) & f.col("message").contains(deleted),
f.lit("deleted"),
)
.when(
f.col("operation").contains(operation) & f.col("message").contains(created),
f.lit("created"),
)
.when(
f.col("operation").contains(operation) & f.col("message").contains(updated),
f.lit("updated"),
),
)
This code works fine, meaning it does the job. But my problem is that it runs too long: for DF with ~168 million rows it runs about 20 mins.
I search 'message' field in this DF for many times for different strings and it usually takes less than a minute to run, but for this specific case running tine is way to high.
What can I do differently in order to reduce running time for this script?
Upvotes: 0
Views: 192
Reputation: 15273
Editing my answer to match your comment
To figure out the bottle neck can be a bit time consuming. You have several transformation in your code, after each of them, add an action, like show or collect for example and measure the time. You will be able to find which transformation is adding the most. Of course, the final action will take more time than the first one because it is a sum of all your transformation but comparing each time after each transformation, you'll be able to figure out if there is a huge gap at one moment.
The other way of analysing is to read the explain plan.
You do all you transformations, you do df.explain()
and try to understand where is your bottle neck. You can also add it to your question, someone may see where the problem comes from.
Upvotes: 2