Is there a more systematic way to resolve a slow AWS Glue + PySpark execution stage?

Question

I have this code snippet that I ran locally in standalone mode using 100 records only:

from awsglue.context import GlueContext
glue_context = GlueContext(sc)
glue_df = glue_context.create_dynamic_frame.from_catalog(database=db, table_name=table)
df = glue_df.toDF()
print(df.count())

The schema contains 89 columns all having string data type except 5 columns that have array of struct data type. The data size is 3.1 MB.

Also, here is some info about the environment used to run the code:

spark.executor.cores: 2
spark.executor.id: driver
spark.driver.memory: 1000M

Problem is I can't find out why stage 1 took 12 minutes to finish where it only has to count 100 records. I can't find what "Scan parquet" and "Exchange" Tasks mean as shown in this image: Stage 1 DAG Visualization

My question is, is there a more systematic way to understand what those tasks mean. As a beginner, I heavily relied on Spark UI but it doesn't give much information about the tasks it has executed. I was able to find which task took the most time but I have no idea why it is the case and how to systematically resolve it.

Amir Maleki · Accepted Answer

The running time in spark code is calculating based on the cluster kick-off time, DAG scheduler optimisation time, running stages time. In your case, the issue could be because of the followings:

The number of parquet files. To test this easily read the table and write it back as one parquet file. You are calling a table but behind the scene, it's reading the physical parquet files so the number of files is an item to consider.
Number of spark clusters. The number of clusters should be a relevant number of computing resources you have. For example, in your case, you have 2 core with a small-size table. So it's more efficient to have just a few partitions instead of the default partition numbers which is 200.

To get more clarification on the spark stages use explain function and read the DAG result. As a result of this function you could see and compare Analyzed Logical Plan, Optimized Logical Plan, and Physical Plan that has been calculated by internal optimiser processes.

To find a more detailed description of the explain function please visit this LINK

Is there a more systematic way to resolve a slow AWS Glue + PySpark execution stage?

Answers (1)

Related Questions