VB_
VB_

Reputation: 45692

Spark: launch from single JVM jobs with different memory/cores configs simultaneously

Problem explanation

Suppose you have Spark cluster with Standalone manager, where jobs are scheduled through SparkSession created at client app. Client app runs on JVM. And you have to launch each job with different configs for the sake of performance, see Job types example below.

The problem is you can't create two sessions from single JVM.

So how you gonna launch multiple Spark jobs with different session configs simultaneously?

By different session configs I mean:

My thoughts

Possible ways to solve the problem:

  1. Set different session configs for each Spark job within the same SparkSession. Is it possible?
  2. Launch another JVM just to start another SparkSession, something that I could call Spark session service. But you never knew how many jobs with different configs you gonna launch in future simultaneously. At the moment - I need only 2-3 different configs at a time. It's may be enough but not flexible.
  3. Make global session with the same configs for all kinds of jobs. But this approach is a bottom from perspective of performance.
  4. Use Spark only for heavy jobs, and run all quick search tasks outside Spark. But that's a mess, since you need to keep another solution (like Hazelcast) in parallel with Spark, and split resources between them. Moreover, that brings extra complexity for all: deployment, support etc.

Job types example

  1. Dump huge database task. It's CPU low but IO intensive long running task. So you may want to launch as many executors as you can with low memory and cores per executor.
  2. Heavy handle-dump-results task. It's CPU intensive so you gonna launch one executor per cluster machine, with maximum CPU and cores.
  3. Quick retrieve data task, which requires one executor per machine and minimal resources.
  4. Something in a middle between 1-2 and 3, where a job should take a half of cluster resources.
  5. etc.

Upvotes: 1

Views: 753

Answers (1)

FaigB
FaigB

Reputation: 2281

Spark standalone uses a simple FIFO scheduler for applications. By default, each application uses all the available nodes in the cluster. The number of nodes can be limited per application, per user, or globally. Other resources, such as memory, cpus, etc. can be controlled via the application’s SparkConf object.

Apache Mesos has a master and slave processes. The master makes offers of resources to the application (called a framework in Apache Mesos) which either accepts the offer or not. Thus, claiming available resources and running jobs is determined by the application itself. Apache Mesos allows fine-grained control of the resources in a system such as cpus, memory, disks, and ports. Apache Mesos also offers course-grained control control of resources where Spark allocates a fixed number of CPUs to each executor in advance which are not released until the application exits. Note that in the same cluster, some applications can be set to use fine-grained control while others are set to use course-grained control.

Apache Hadoop YARN has a ResourceManager with two parts, a Scheduler, and an ApplicationsManager. The Scheduler is a pluggable component. Two implementations are provided, a CapacityScheduler, useful in a cluster shared by more than one organization, and the FairScheduler, which ensures all applications, on average, get an equal number of resources. Both schedulers assign applications to a queues and each queue gets resources that are shared equally between them. Within a queue, resources are shared between the applications. The ApplicationsManager is responsible for accepting job submissions and starting the application specific ApplicationsMaster. In this case, the ApplicationsMaster is the Spark application. In the Spark application, resources are specified in the application’s SparkConf object.

For your case just with standalone it is not possible may be there can be some premise solutions but I haven't faced

Upvotes: 1

Related Questions