Morariu
Morariu

Reputation: 384

Submit spark jobs to AWS EMR with arguments via the aws-cli

I am looking to modify spark jobs that are submitted to a Google Dataproc cluster. These jobs will run on an AWS EMR cluster instead.

gcloud dataproc jobs submit spark \
--cluster "${HADOOP_CLUSTER_NAME}" \
--properties "${SPARK_PARTITIONS}${SPARK_PARALLELISM}spark.master=yarn,spark.app.name=${APP},spark.sql.parquet.mergeSchema=false,spark.driver.memory=${D_MEMORY},spark.ui.port=0,spark.dynamicAllocation.enabled=false,spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35,spark.driver.extraClassPath=/usr/lib/hadoop-lzo/lib/*:./" \
--class com.custom.scriptrunner.MyCustomSparkScriptRunner \
--files $CONFIG,$TRUSTSTORE \
--jars $JARS \
-- -s $SCRIPT -c $CONFIG_FILE -r $CONFIG_ROOT -l myMetrics

I have tried the following with spark-submit directly on the master node instead of AWS EMR cli:

spark-submit --deploy-mode cluster --class com.custom.scriptrunner.MyCustomSparkScriptRunner --files $CONFIG_FILE --jars $JARS --conf spark.app.name=${APP} --conf spark.driver.extraClassPath=/usr/lib/hadoop-lzo/lib/*:./ -s $SCRIPT -c $CONFIG_FILE_NAME -r $CONFIG_ROOT -l myMetrics

But I can't find a way to add the following line of arguments (with spark-submit or AWS EMR cli). It does not recognize the options.

-- -s $SCRIPT -c $CONFIG_FILE -r $CONFIG_ROOT -l myMetrics

Also found this AWS CLI command but still not finding the syntax to specify the above arguments.

aws emr add-steps --cluster-id j-xxxxxxxx --steps Name="add emr step to run spark",Jar="command-runner.jar",Args=[spark-submit,--class,org.apache.spark.examples.SparkPi,/usr/lib/spark/examples/jars/spark-examples.jar,10]

Upvotes: 0

Views: 1818

Answers (1)

gbharat
gbharat

Reputation: 276

You can pass arguments to spark-submit as mention here and read each argument by it's position in application code.

Upvotes: -1

Related Questions