How to improve the utilization of Spark job, based on YARN or kubernetes

Question

I am working on improving the utilization of cluster, and the cluster is YARN and will be Kubernetes.

My question is that how to improve the utilization ratio ? How to think this question, are there some methods? For YARN, and for Kubernetes?

For YARN, I have read some articles or watched some videos. YARN has NM and RM.

Oversubscription based on historical job running data. (https://databricks.com/session/oversubscribing-apache-spark-resource-usage-for-fun-and)
a. set appropriate MEMORY (5G) and CPU for job
b. set a buffer for the job (1G)
c. do preemption to actively on NM

Oversubscription based on the real time utilization. (https://research.facebook.com/publications/ubis-utilization-aware-cluster-scheduling/)
a. do not modify job settings
b. monitor the utilization and allocation of NM, do oversubscription to the node
c. do preemption actively on NM
Oversubscription of NM resources
a. NM has 100G and 30 cores in physical, but announce have 120G and 40 cores.
b. preemption handled by spark or YARN framework.

Matt Andruff · Accepted Answer

I have had a lot of success with over subscription. Classically users overestimate their requirements.

A cool tool that Linked in release is Dr. Elephant. It looked at helping users tune their own jobs to help educate users/give them the tools to stop over subscription. https://github.com/linkedin/dr-elephant. It seems to have been quite for a couple of years but might be worth while to look at the code to see what they looked at to help you make some educated judgements about over subscriptions.

I don't have anything to do with PepperData but they're tuning does use over subscription to optimize the cluster. So it's definitely a recognized pattern. If you want a service provider to help you with optimizing they might be a good team to talk to.

I would suggest that you just use a classic performance tuning strategy. Record your existing weekly metrics. Understand what going on in your cluster. Make a change - Bump everything by 10% and see if you get a boost in performance. Understand what going on in your cluster. If it works and is stable do it again the following week. Do that until you see an issue or stop seeing improvement. It takes time and careful recording of what's happening but it's likely the only way to tune your cluster as it's likely special.

How to improve the utilization of Spark job, based on YARN or kubernetes

Answers (1)

Related Questions