I built an ETL pipeline to process terabytes of data. To achieve that goal, I set up a Spark Cluster (Scala) and MinIO server for object data storage. I can process and save 200 gigabytes in roughly 30 minutes using 10 virtual machines, for Spark Processing. The issue I have is that I am not able to scale. Meaning, if I double the number of spark virtual machines, this does not affect processing time. I need some guidance to identify the bottleneck. ARCHITECTURE SUMMARY. I use virtual machines set up on-premises using VMWare ESXi 6 Physical machines (which host VMs) are on a 1 GB network. There is no over commitment for vCPU nor RAM Spark VMs: 16VCPU, 64 GB RAM MinIO (Storage). 16vCPU, 64GB RAM, Configured using RAID0 SOME DETAILS ABOUT DATA PROCESSING The process is straight. Read data from 2 sources on MinIO, Make a Union of data of two sources, Filter out empty values on a column from resulting dataset, Apply 2 groupby on that column (I save intermediate values after the first groupby) Union the dataset obtained after the groupby operation with the empty columns values Save the whole again on MinIO MY SPARK CONFIGURATION Spark is deployed in Client mode. mode = "client" network.timeout = 1800001 rpc.askTimeout = 1800000 default.parallelism = 320 sql.shuffle.partitions = 320 # spark.sql.adaptive.coalescePartitions.enabled=true spark.sql.adaptive.enabled=true sql.files.maxRecordsPerFile = "100000" // spark.write sql.files.maxPartitionBytes = "31457280" // spark.read sql.adaptive.advisoryPartitionSizeInBytes = "30m" sql.objectHashAggregate.sortBased.fallbackThreshold = -1 memory.fraction = 0.85 # executor configs executor.cores = 4 executor.memory = 12g executor.memoryOverhead = 3g total.executor.cores = 160 executor.instances = 40 executor.heartbeatInterval = 1800000

Reputation: 63

Cannot scale spark job execution time

I built an ETL pipeline to process terabytes of data. To achieve that goal, I set up a Spark Cluster (Scala) and MinIO server for object data storage. I can process and save 200 gigabytes in roughly 30 minutes using 10 virtual machines, for Spark Processing. The issue I have is that I am not able to scale. Meaning, if I double the number of spark virtual machines, this does not affect processing time.

I need some guidance to identify the bottleneck.

ARCHITECTURE SUMMARY.

I use virtual machines set up on-premises using VMWare ESXi 6 Physical machines (which host VMs) are on a 1 GB network.
There is no over commitment for vCPU nor RAM
Spark VMs: 16VCPU, 64 GB RAM
MinIO (Storage). 16vCPU, 64GB RAM, Configured using RAID0

SOME DETAILS ABOUT DATA PROCESSING The process is straight.

Read data from 2 sources on MinIO,
Make a Union of data of two sources,
Filter out empty values on a column from resulting dataset,
Apply 2 groupby on that column (I save intermediate values after the first groupby)
Union the dataset obtained after the groupby operation with the empty columns values
Save the whole again on MinIO

MY SPARK CONFIGURATION Spark is deployed in Client mode.

  mode = "client"
  network.timeout = 1800001 
  rpc.askTimeout = 1800000
  default.parallelism = 320 
  sql.shuffle.partitions = 320 #
  spark.sql.adaptive.coalescePartitions.enabled=true
  spark.sql.adaptive.enabled=true
  sql.files.maxRecordsPerFile = "100000" // spark.write
  sql.files.maxPartitionBytes = "31457280" // spark.read
  sql.adaptive.advisoryPartitionSizeInBytes = "30m"

  sql.objectHashAggregate.sortBased.fallbackThreshold = -1
  memory.fraction = 0.85

  # executor configs
  executor.cores = 4 
  executor.memory = 12g 
  executor.memoryOverhead = 3g
  total.executor.cores = 160
  executor.instances = 40
  executor.heartbeatInterval = 1800000

Upvotes: 0

Answers (2)

aj-jester

Reputation: 36

Based on the infrastructure you mentioned, in order to scale the spark jobs you also need to scale the data backend. This is similar to how if you have an app which writes to a traditional database, if the amount of requests increase, if you increase the front end workers it won't actually increase the performance until the database is also scale to match the additional requests. You have to consider all bottlenecks in your infrastructure.

So in this case, if you are increasing your spark VM count you also need to scale the MinIO backend accordingly to get the performance increase.

Upvotes: 1

Harshavardhana

Reputation: 1428

You cannot just scale spark to scale your performance, you need to first measure if your MinIO is deployed to your required scale.

From what I can gather you have only deployed MinIO on a single node, there is not only so much storage that can do to serve as you scale your clients.

Just scaling clients is not sufficient.

Upvotes: 0

Cannot scale spark job execution time

Answers (2)

Related Questions