Fisher Coder
Fisher Coder

Reputation: 3576

aws emr add-steps a spark application

I'd like to add a step as a spark application using AWS CLI, but I cannot find a working command, from AWS official doc: https://docs.aws.amazon.com/cli/latest/reference/emr/add-steps.html, they listed out 6 examples, none of them is for spark. But I could configure it through AWS Console UI and it runs fine, but for efficiency, I'd like to be able to do so via aws cli.

The closest that I could come up with is this command:

aws emr add-steps --cluster-id j-cluster-id --steps  Type=SPARK,Name='SPARK APP',ActionOnFailure=CONTINUE,Jar=s3://my-test/RandomJava-1.0-SNAPSHOT.jar,MainClass=JavaParquetExample1,Args=s3://my-test/my-file_0000_part_00.parquet,my-test --profile my-test --region us-west-2

but this resulted in this configuration on AWS EMR step:

JAR location : command-runner.jar
Main class : None
Arguments : spark-submit s3://my-test/my-file_0000_part_00.parquet my-test
Action on failure: Continue

which resulted in failure.

The correct one (completed successfully, configured through AWS Console UI) looks like this:

JAR location : command-runner.jar
Main class : None
Arguments : spark-submit --deploy-mode cluster --class sparkExamples.JavaParquetExample1 s3://my-test/RandomJava-1.0-SNAPSHOT.jar --s3://my-test/my-file_0000_part_00.parquet --my-test
Action on failure: Continue

Any help is greatly appreciated!

Upvotes: 0

Views: 2028

Answers (1)

Ajay Kr Choudhary
Ajay Kr Choudhary

Reputation: 1362

This seems to be working for me. I am adding a spark application to a cluster with the step name My step name. Let's say you name the file as step-addition.sh. The content of it is following:

#!/bin/bash
set -x

#cluster id
clusterId=$1
startDate=$2
endDate=$3

aws emr add-steps --cluster-id $clusterId --steps Type=Spark,Name='My step name',\
ActionOnFailure=TERMINATE_CLUSTER,Args=[\
"--deploy-mode","cluster","--executor-cores","1","--num-executors","20","--driver-memory","10g","--executor-memory","3g",\
"--class","your-package-structure-like-com.a.b.c.JavaParquetExample1",\
"--master","yarn",\
"--conf","spark.driver.my.custom.config1=my-value-1",\
"--conf","spark.driver.my.custom.config2=my-value-2",\
"--conf","spark.driver.my.custom.config.startDate=${startDate}",\
"--conf","spark.driver.my.custom.config.endDate=${endDate}",\
"s3://my-bucket/my-prefix/path-to-your-actual-application.jar"]

You can execute the above script simply like this:

bash $WORK_DIR/step-addition.sh $clusterId $startDate $endDate

Upvotes: 1

Related Questions