yaoviametepe
yaoviametepe

Reputation: 63

Cannot scale spark job execution time

I built an ETL pipeline to process terabytes of data. To achieve that goal, I set up a Spark Cluster (Scala) and MinIO server for object data storage. I can process and save 200 gigabytes in roughly 30 minutes using 10 virtual machines, for Spark Processing. The issue I have is that I am not able to scale. Meaning, if I double the number of spark virtual machines, this does not affect processing time.

I need some guidance to identify the bottleneck.

ARCHITECTURE SUMMARY.

SOME DETAILS ABOUT DATA PROCESSING The process is straight.

MY SPARK CONFIGURATION Spark is deployed in Client mode.

  mode = "client"
  network.timeout = 1800001 
  rpc.askTimeout = 1800000
  default.parallelism = 320 
  sql.shuffle.partitions = 320 #
  spark.sql.adaptive.coalescePartitions.enabled=true
  spark.sql.adaptive.enabled=true
  sql.files.maxRecordsPerFile = "100000" // spark.write
  sql.files.maxPartitionBytes = "31457280" // spark.read
  sql.adaptive.advisoryPartitionSizeInBytes = "30m"

  sql.objectHashAggregate.sortBased.fallbackThreshold = -1
  memory.fraction = 0.85

  # executor configs
  executor.cores = 4 
  executor.memory = 12g 
  executor.memoryOverhead = 3g
  total.executor.cores = 160
  executor.instances = 40
  executor.heartbeatInterval = 1800000

Upvotes: 0

Views: 121

Answers (2)

aj-jester
aj-jester

Reputation: 36

Based on the infrastructure you mentioned, in order to scale the spark jobs you also need to scale the data backend. This is similar to how if you have an app which writes to a traditional database, if the amount of requests increase, if you increase the front end workers it won't actually increase the performance until the database is also scale to match the additional requests. You have to consider all bottlenecks in your infrastructure.

So in this case, if you are increasing your spark VM count you also need to scale the MinIO backend accordingly to get the performance increase.

Upvotes: 1

Harshavardhana
Harshavardhana

Reputation: 1428

You cannot just scale spark to scale your performance, you need to first measure if your MinIO is deployed to your required scale.

From what I can gather you have only deployed MinIO on a single node, there is not only so much storage that can do to serve as you scale your clients.

Just scaling clients is not sufficient.

Upvotes: 0

Related Questions