Reputation: 49
I am working on a project where in I have to tune spark's performance. I have found four most important parameters that will help in tuning spark's performance. They are as follows:
I wanted to know whether I am going in the right direction or not? Please let me know if I missed out on some other parameters also.
Thanks in advance.
Upvotes: 4
Views: 1350
Reputation: 166
We can divide the problem into two parts.
If general depending upon if the memory in question is spark execution memory or user memory, spark will spill or OOM. I think the memory tuning part will also include the total size of the executor memory.
For the second question: How to optimise for cost, time, compute, etc try Sparklens https://github.com/qubole/sparklens Shameless Plug (Author). Most of the time the real question is not if the application is slow, but will it scale or is it even using the given resources. And for most of the applications, answer is upto a limit.
The structure of spark application puts important constraints on its scalability. Number of tasks in a stage, dependencies between stages, skew and amount of work done on the driver side are the main constraints.
One of the best features of Sparklens is that it simulates and tell you how your spark application will perform with different executor counts and what is the expected cluster utilisation level at each executor count. Helps you make the right trade-off between time and efficiency.
Upvotes: 1
Reputation: 40360
This is is quite broad to answer honestly. The right path to optimize performance is mainly described in the official documentation in the section concerning Tuning Spark.
Generally speaking, there is lots of factors to optimize spark jobs :
It's mainly centralized around data serialization, memory tuning and a trade-off between precision/approximation techniques to get the job done fast.
EDIT:
Courtesy of @zero323 :
I'd point out, that all but one option mentioned in the question, are deprecated and used only in legacy mode.
Upvotes: 2