Imran
Imran

Reputation: 5642

Spark Jobs on Yarn | Performance Tuning & Optimization

What is the best way to optimize the Spark Jobs deployed on Yarn based cluster ? .

Looking for changes based on configuration not code level. My Question is classically design level question, what approach should be used to optimized the Jobs that are either developed on Spark Streaming or Spark SQL.

Upvotes: 0

Views: 1018

Answers (2)

Rohit Karlupia
Rohit Karlupia

Reputation: 166

Assuming that the application works i.e memory configuration is taken care of and we have at least one successful run of the application. I usually look for underutilisation of executors and try to minimise it. Here are the common questions worth asking to find opportunities for improving utilisation of cluster/executors:

  1. How much of work is done in driver vs executor? Note that when the main spark application thread is in driver, executors are killing time.
  2. Does you application have more tasks per stage than number of cores? If not, these cores will not be doing anything while in this stage.
  3. Are your tasks uniform i.e not skewed. Since spark move computation from stage to stage (except for some stages that can be parallel), it is possible for most of your tasks to complete and yet the stage is still running because one of skewed task is still held up.

Shameless Plug (Author) Sparklens https://github.com/qubole/sparklens can answer these questions for you, automatically.

Some of things are not specific to the application itself. Say if your application has to shuffle lots of data, pick machines with better disks and network. Partition your data to avoid full data scans. Use columnar formats like parquet or ORC to avoid fetching data for columns you don't need all the time. The list is pretty long and some problems are known, but don't have good solutions yet.

Upvotes: 1

Imran
Imran

Reputation: 5642

There is myth that BigData is magic and your code will be work like a dream once deployed to a BigData cluster.

Every newbie have same belief :) There is also misconception that given configurations over web blogs will be working fine for every problem.

There is no shortcut for optimization or Tuning the Jobs over Hadoop without understating your cluster deeply.

But considering the below approach I'm certain that you'll be able to optimize your job within a couple of hours.

I prefer to apply the pure scientific approach to optimize the Jobs. Following steps can be followed specifically to start optimization of Jobs as baseline.

  1. Understand the Block Size configured at cluster.
  2. Check the maximum memory limit available for container/executor.
  3. Under the VCores available for cluster
  4. Optimize the rate of data specifically in case of Spark streaming real-time jobs. (This is most tricky park in Spark-streaming)
  5. Consider the GC setting while optimization.
  6. There is always room of optimization at code level, that need to be considered as well.
  7. Control the block size optimally based on cluster configuration as per Step 1. based on data rate. Like in Spark it can be calculated batchinterval/blockinterval

Now the most important steps come here. The knowledge I'm sharing is more specific to real-time use cases like Spark streaming, SQL with Kafka.

First of all you need to know to know that at what number or messages/records your jobs work best. After it you can control the rate to that particular number and start configuration based experiments to optimize the jobs. Like I've done below and able to resolve performance issue with high throughput.

Performance Optimization Experiments

I have read some of parameters from Spark Configurations and check the impact on my jobs than i made the above grid and start the experiment with same job but with five difference configuration versions. Within three experiment I'm able to optimize my job. The green highlighted in above picture is magic formula for my jobs optimization.

Although the same parameters might be very helpful for similar use cases but obviously these parameter not covers everything.

Upvotes: 2

Related Questions