jimmone
jimmone

Reputation: 466

AWS Glue Spark job does not scale when partitioning DataFrame

I am developing a Glue Spark job script using Glue development endpoint which has 4 DPUs allocated. According to Glue documentation 1 DPU equals to 2 executors and each executor can run 4 tasks. 1 DPU is reserved for master and 1 executor is for the driver. Now when my development endpoint has 4 DPUs I expect to have 5 executors and 20 tasks.

The script I am developing loads 1 million rows using JDBC connection. Then I coalesce the one million row partition into 5 partitions and write it to S3 bucket using the option maxRecordsPerFile = 100000. The whole process takes 34 seconds. Then I change the number of partitions to 10 and the job runs for 34 seconds again. So if I have 20 tasks available why is the script taking the same amount of time to complete with more partitions?

Edit: I started executing the script with an actual job, not development endpoint. I set the amount of workers to 10 and worker type to standard. Looking at metrics I can see that I have only 9 executors instead of 17 and only 1 executor is doing something and the rest are idle.

Code:

...

df = spark.read.format("jdbc").option("driver", job_config["jdbcDriver"]).option("url", jdbc_config["url"]).option(
    "user", jdbc_config["user"]).option("password", jdbc_config["password"]).option("dbtable", query).option("fetchSize", 50000).load()

df.coalesce(17)

df.write.mode("overwrite").format("csv").option(
    "compression", "gzip").option("maxRecordsPerFile", 1000000).save(job_config["s3Path"])

...

Upvotes: 1

Views: 1986

Answers (1)

Eman
Eman

Reputation: 851

This is highly likely a limitation of the connections being opened to your jdbc data source, too few connections reduce parallelism too much may burden your database. Increase the degree of parallelism by tuning the options here.

Since you are reading as a data frame, you can set the upper lower bound and the partition columns. More can be found here.

To size your DPUs correctly, I would suggest linking the spark-ui, it could help narrow down where all the time is spend and the actual distribution of your tasks when you look at the DAG.

Upvotes: 2

Related Questions