Reputation: 9771
I'm about to try EMR and henceforth going trough the documentation right now. I'm rather a bit confused by the submit process.
1) Where are the spark Libraries
From the Spark documentation we find:
- spark.yarn.jars: List of libraries containing Spark code to distribute to YARN containers. By default, Spark on YARN will use Spark jars installed locally, but the Spark jars can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to jars on HDFS, for example, set this configuration to hdfs:///some/path. Globs are allowed.
2) How to the --master parameter works ?
From the spark documentation we have:
- --master:Unlike other cluster managers supported by Spark in which the master’s address is specified in the --master parameter, in YARN mode the ResourceManager’s address is picked up from the Hadoop configuration. Thus, the --master parameter is yarn.
3) Is there a way to submit the application from by terminal, or is the only way to actually deploy the jar on S3 ? Can I log on to the master and do the submit from there ? Will all the Env variable necessary for the submit-script to work ready (see previous question) ? What is the most productive way to do this submit ?
Upvotes: 4
Views: 6002
Reputation: 992
Where are the spark Libraries? spark
is available in the path, meaning, you can run spark-submit
from Command line interface anywhere on the master node, however, if you want to tweak the config files of spark, they are located under /etc/spark/conf/
on all nodes.
How to submit Spark application? There are two ways
a) CLI on the master node:
issue spark-submit
with all the params, ex: spark-submit --class com.some.core.Main --deploy-mode cluster --master yarn --jars s3://path_to_some_jar.jar
b) AWS EMR Web console:
Submitting a spark application from EMR web console means submitting an EMR step, an EMR step is basically a UI version of spark submit, more info here
How does the --master parameter works, is it set up by EMR directly? This is set automatically if you are using the AWS EMR step (ie web console way), the UI will automatically add this for you, but if you are using the CLI as question 2a
then you need to mention it specifically.
4) Is the only way to actually deploy the jar on S3? There are two (or more) ways
s3
and reference it when submitting.5) Will all the Env variable necessary for the submit-script to work ready?
spark
application to EMR, it is a fully configured ready-to-use spark
cluster.bootstrap action
to execute a script, which only can be done during cluster creation, more info here6) What is the most productive way to do this submit? This depends on the use case, if you can/want to manage the job yourself, simply do a spark-submit
but to get the advantages of AWS EMR automatic debugging log, then AWS EMR step
is the way to go.
Update:
7) How to change configurations of yarn, spark etc? Again there are two options
/etc/hadoop/conf
, modify these on the master node, you probably have to restart the yarn manager on the master node.AWS Web Console: You can submit a configuration on the web console as mentioned here when creating a cluster, for example, if you want to enable YARN FAIR scheduling, the config JSON to supply will look like
{
'classification': 'yarn-site',
'Properties': { 'yarn.resourcemanager.scheduler.class':'org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler'
}
}
PS: I forgot to mention that, almost whatever you can do on the AWS web console, you can do the same programmatically with AWS CLI or AWS SDK.
Upvotes: 5