Reputation: 63
I built an ETL pipeline to process terabytes of data. To achieve that goal, I set up a Spark Cluster (Scala) and MinIO server for object data storage. I can process and save 200 gigabytes in roughly 30 minutes using 10 virtual machines, for Spark Processing. The issue I have is that I am not able to scale. Meaning, if I double the number of spark virtual machines, this does not affect processing time.
I need some guidance to identify the bottleneck.
ARCHITECTURE SUMMARY.
SOME DETAILS ABOUT DATA PROCESSING The process is straight.
MY SPARK CONFIGURATION Spark is deployed in Client mode.
mode = "client"
network.timeout = 1800001
rpc.askTimeout = 1800000
default.parallelism = 320
sql.shuffle.partitions = 320 #
spark.sql.adaptive.coalescePartitions.enabled=true
spark.sql.adaptive.enabled=true
sql.files.maxRecordsPerFile = "100000" // spark.write
sql.files.maxPartitionBytes = "31457280" // spark.read
sql.adaptive.advisoryPartitionSizeInBytes = "30m"
sql.objectHashAggregate.sortBased.fallbackThreshold = -1
memory.fraction = 0.85
# executor configs
executor.cores = 4
executor.memory = 12g
executor.memoryOverhead = 3g
total.executor.cores = 160
executor.instances = 40
executor.heartbeatInterval = 1800000
Upvotes: 0
Views: 121
Reputation: 36
Based on the infrastructure you mentioned, in order to scale the spark jobs you also need to scale the data backend. This is similar to how if you have an app which writes to a traditional database, if the amount of requests increase, if you increase the front end workers it won't actually increase the performance until the database is also scale to match the additional requests. You have to consider all bottlenecks in your infrastructure.
So in this case, if you are increasing your spark VM count you also need to scale the MinIO backend accordingly to get the performance increase.
Upvotes: 1
Reputation: 1428
You cannot just scale spark to scale your performance, you need to first measure if your MinIO is deployed to your required scale.
From what I can gather you have only deployed MinIO on a single node, there is not only so much storage that can do to serve as you scale your clients.
Just scaling clients is not sufficient.
Upvotes: 0