PySpark extremely slow uploading to S3 running on Databricks

My ETL script reads three tables from a relational database, performs some operations through PySpark and upload this to my S3 bucket (with S3a).

Here's the code that makes the upload:

dataframe.write.mode("overwrite").partitionBy("dt").parquet(entity_path)

I've about 2 million lines which are written on S3 in parquet files partitioned by date ('dt').

My script is taking more than two hours to make this upload to S3 (this is extremely slow) and it's running on Databricks in a cluster with:

 3-8 Workers: 366.0-976.0 GB Memory, 48-128 Cores, 12-32 DBU

I've concluded that the problem in on upload, my I can't figure out what's going on.

Update: Using repartition('dt') the execution time was reduced to ~20 minutes. This helps me, but I think it should execute in less time.

Upvotes: 1

Views: 1736

Answers (2)

jon
jon

Reputation: 405

more worker will help because one worker(job) can only has one s3 connection

Upvotes: 0

As I've updated on the question, adding repartition('dt') the execution time was reduced to ~13 to 20 minutes.

dataframe.write.mode("overwrite").partitionBy("dt").parquet(entity_path)

After some analyses, I've concluded the cluster was processing the upload serialized and the files were being uploaded one by one in asc order by date in S3.

So adding the repartition, the cluster reorganizes the data between its nodes and uploads files randomically making the upload faster (from ~3 hours to 20 minutes).

This solution helped me. If anyone knows a better approach or have any contributions I'll be glad to know.

Upvotes: 2

Related Questions