Guforu
Guforu

Reputation: 4023

Spark Standalone

I have Ubuntu 14.04 with 4 cpus on my machine (nproc get 4 back). After I have installed and executed Spark Standalone (local), I can someself define the different number of slaves. For example I want to have 4 slaves (workers). After the execution of this number of slaves, I had next spark standalone screen:

enter image description here

How is it possible that I have total number of corse 16 (orange field) and Memory 11 GB, if I have for a uinique worker already 4 cores (I think 1 core is 1 cpu)? And what is an avantage, if I have 4 slaves instead of one? Probably, if I execute it local, I don't have any (it will be also slower), but if I have a hadoop cluster, how the cores should be shared and how I can improve the speed of programm execution? Some additional question, if I start some application (scala, python or java) the first one is RUNNING, the other 2 or 3 should be in WAITING mode. Is it possible to run all applications parallel to each other?

Upvotes: 0

Views: 616

Answers (1)

Bacon
Bacon

Reputation: 1844

You are misunderstanding several things here:

Standalone

This does not mean "local". Standalone mode is the application master builtin Spark, which can be replaced by YARN or MESOS. You can use as many nodes as you want. You can indeed only run locally, on a given number X of threads, by, for example, running the ./bin/spark-shell --master local[X] command.

Cores/memory

Those number reflect the total amount of resources in your cluster, rounded up. Here, if we do the math, you have 4 * 4 cpus = 16 cpus, and 4 * 2.7 GB ~= 11 GB of memory.

Resource management

If I have a hadoop cluster, how the cores should be shared

A Hadoop cluster is different from Spark cluster. There is several ways to combine both of them, but most of the time the part of Hadoop you'll be using in combination with Spark is HDFS, the distributed filesystem.

Depending on the application master you're using with Spark, the cores will be managed differently:

  • YARN use node managers on the nodes, to launch containers in which you can launch Spark's Executors (one executor = one jvm)

  • Spark Standalone use workers as a gateway to launch the Executors

  • Mesos directly launch executors

Scheduling

Hadoop and Spark use a technique known as delay scheduling, which basically rely on the principle that an application can decide to refuse an offer from a worker, to place one of it's tasks, with hope that it can later receive a better offer, in terms of data locality.

How I can improve the speed of programm execution?

This is a complex question that can not be answer without knowledge of your infrastructure, input data, and application. Here are some of the parameters that will affect your performances:

  • Amount of memory available (mainly, to cache RDD that are often used)
  • Use of compression for your data/RDD
  • Application configuration

Is it possible to run all applications parallel to each other?

By default, the Standalone master uses a FIFO scheduler for it's apps, but you can set up the Fair Scheduler inside an application. For more details, see the scheduling documentation.

Upvotes: 3

Related Questions