Srinivas Shekar
Srinivas Shekar

Reputation: 49

Apache Spark's performance tuning

I am working on a project where in I have to tune spark's performance. I have found four most important parameters that will help in tuning spark's performance. They are as follows:

  1. spark.memory.fraction
  2. spark.memory.offHeap.size
  3. spark.storage.memoryFraction
  4. spark.shuffle.memoryFraction

I wanted to know whether I am going in the right direction or not? Please let me know if I missed out on some other parameters also.

Thanks in advance.

Upvotes: 4

Views: 1350

Answers (2)

Rohit Karlupia
Rohit Karlupia

Reputation: 166

We can divide the problem into two parts.

  1. Make it run
  2. Optimize for cost or time

If general depending upon if the memory in question is spark execution memory or user memory, spark will spill or OOM. I think the memory tuning part will also include the total size of the executor memory.

For the second question: How to optimise for cost, time, compute, etc try Sparklens https://github.com/qubole/sparklens Shameless Plug (Author). Most of the time the real question is not if the application is slow, but will it scale or is it even using the given resources. And for most of the applications, answer is upto a limit.

The structure of spark application puts important constraints on its scalability. Number of tasks in a stage, dependencies between stages, skew and amount of work done on the driver side are the main constraints.

One of the best features of Sparklens is that it simulates and tell you how your spark application will perform with different executor counts and what is the expected cluster utilisation level at each executor count. Helps you make the right trade-off between time and efficiency.

Upvotes: 1

eliasah
eliasah

Reputation: 40360

This is is quite broad to answer honestly. The right path to optimize performance is mainly described in the official documentation in the section concerning Tuning Spark.

Generally speaking, there is lots of factors to optimize spark jobs :

  • Data Serialization
  • Memory Tuning
  • Level of Parallelism
  • Memory Usage of Reduce Tasks
  • Broadcasting Large Variables
  • Data Locality

It's mainly centralized around data serialization, memory tuning and a trade-off between precision/approximation techniques to get the job done fast.

EDIT:

Courtesy of @zero323 :

I'd point out, that all but one option mentioned in the question, are deprecated and used only in legacy mode.

Upvotes: 2

Related Questions