IttayD
IttayD

Reputation: 29123

Spark and Map-Reduce together

What is the best approach to run Spark on a cluster that runs map reduce jobs?

First question is about co-locality with data. When I start a Spark application, it allocates executors, right? How does it know where to allocate them so they are in the same nodes as the data that jobs will need? (one job may want one piece of data while the job after it may need another)

If I keep the Spark application up, then the executors take slots from the machines in the cluster does it mean that for co-locality I need to have a Spark executor on every node?

With executors running, it means that there are less resources for my map reduce jobs, right? I can stop and start the Spark application for every job, but then it takes away from the speed advantages of having the executors up and running, correct (also the benefits of hotspot for long running processes?)

I have read that container re-sizing (YARN-1197) will help, but doesn't that just mean that executors will stop and start? Isn't that the same as stopping the spark application (in other words, if there are no live executors, what is the benefit of having the Spark application up vs shutting it down and starting when a job requires executors)

Upvotes: 1

Views: 887

Answers (1)

Pankaj Arora
Pankaj Arora

Reputation: 544

  1. Data Locality of executors : Spark does not deal with Data locality while launching executors but while launching tasks on them. So you might need to have executors on each data node(HDFS Redundancy can help you even if you dont have executors on each node).

  2. Long Running process: Whether to shutdown your application or not depends on the use case. If you want to serve real time application requests/spark streaming you will not want to shut down the spark. But if you are doing batch processing you should shut down your executor. For Caching of data across jobs you should consider either HDFS Cache or tachyon. You can also consider dynamic allocation of spark with which you can free executors if they are not used for some time.(http://spark.apache.org/docs/latest/configuration.html#dynamic-allocation).

  3. YARN-1197 - will help in releasing the number of cpus/memory you allocated to containers. I am not sure though if spark supports this or not.

Upvotes: 3

Related Questions