Reputation: 4023
I have Ubuntu 14.04 with 4 cpus on my machine (nproc
get 4 back).
After I have installed and executed Spark Standalone (local), I can someself define the different number of slaves. For example I want to have 4 slaves (workers). After the execution of this number of slaves, I had next spark standalone screen:
How is it possible that I have total number of corse 16 (orange field) and Memory 11 GB, if I have for a uinique worker already 4 cores (I think 1 core is 1 cpu)? And what is an avantage, if I have 4 slaves instead of one? Probably, if I execute it local, I don't have any (it will be also slower), but if I have a hadoop cluster, how the cores should be shared and how I can improve the speed of programm execution? Some additional question, if I start some application (scala, python or java) the first one is RUNNING, the other 2 or 3 should be in WAITING mode. Is it possible to run all applications parallel to each other?
Upvotes: 0
Views: 616
Reputation: 1844
You are misunderstanding several things here:
Standalone
This does not mean "local". Standalone mode is the application master builtin Spark, which can be replaced by YARN or MESOS. You can use as many nodes as you want. You can indeed only run locally, on a given number X of threads, by, for example, running the ./bin/spark-shell --master local[X]
command.
Cores/memory
Those number reflect the total amount of resources in your cluster, rounded up. Here, if we do the math, you have 4 * 4 cpus = 16 cpus
, and 4 * 2.7 GB ~= 11 GB
of memory.
Resource management
If I have a hadoop cluster, how the cores should be shared
A Hadoop cluster is different from Spark cluster. There is several ways to combine both of them, but most of the time the part of Hadoop you'll be using in combination with Spark is HDFS, the distributed filesystem.
Depending on the application master you're using with Spark, the cores will be managed differently:
YARN use node managers on the nodes, to launch containers in which you can launch Spark's Executors (one executor = one jvm)
Spark Standalone use workers as a gateway to launch the Executors
Mesos directly launch executors
Scheduling
Hadoop and Spark use a technique known as delay scheduling, which basically rely on the principle that an application can decide to refuse an offer from a worker, to place one of it's tasks, with hope that it can later receive a better offer, in terms of data locality.
How I can improve the speed of programm execution?
This is a complex question that can not be answer without knowledge of your infrastructure, input data, and application. Here are some of the parameters that will affect your performances:
Is it possible to run all applications parallel to each other?
By default, the Standalone master uses a FIFO scheduler for it's apps, but you can set up the Fair Scheduler inside an application. For more details, see the scheduling documentation.
Upvotes: 3