AWS Glue Spark job does not scale when partitioning DataFrame

Question

I am developing a Glue Spark job script using Glue development endpoint which has 4 DPUs allocated. According to Glue documentation 1 DPU equals to 2 executors and each executor can run 4 tasks. 1 DPU is reserved for master and 1 executor is for the driver. Now when my development endpoint has 4 DPUs I expect to have 5 executors and 20 tasks.

The script I am developing loads 1 million rows using JDBC connection. Then I coalesce the one million row partition into 5 partitions and write it to S3 bucket using the option maxRecordsPerFile = 100000. The whole process takes 34 seconds. Then I change the number of partitions to 10 and the job runs for 34 seconds again. So if I have 20 tasks available why is the script taking the same amount of time to complete with more partitions?

Edit: I started executing the script with an actual job, not development endpoint. I set the amount of workers to 10 and worker type to standard. Looking at metrics I can see that I have only 9 executors instead of 17 and only 1 executor is doing something and the rest are idle.

Code:

...

df = spark.read.format("jdbc").option("driver", job_config["jdbcDriver"]).option("url", jdbc_config["url"]).option(
    "user", jdbc_config["user"]).option("password", jdbc_config["password"]).option("dbtable", query).option("fetchSize", 50000).load()

df.coalesce(17)

df.write.mode("overwrite").format("csv").option(
    "compression", "gzip").option("maxRecordsPerFile", 1000000).save(job_config["s3Path"])

...

AWS Glue Spark job does not scale when partitioning DataFrame

Answers (1)

Related Questions