akash sharma
akash sharma

Reputation: 461

Facing error while trying to create transient cluster on AWS emr to run Python script

I am new to aws and trying to create a transient cluster on AWS emr to run a Python script. I just want to run the python script that will process the file and auto terminate the cluster post completion. I have also created a keypair and specified the same.

Command below :

aws emr create-cluster --name "test1-cluster" --release-label emr-5.5.0 --name pyspark_analysis --ec2-attributes KeyName=k-key-pair --applications Name=Hadoop Name=Hive Name=Spark --instance-groups --use-default-roles --instance-type m5-xlarge --instance-count 2 --region us-east-1 --log-uri s3://k-test-bucket-input/logs/ --steps Type=SPARK, Name="pyspark_analysis", ActionOnFailure=CONTINUE, Args=[-deploy-mode,cluster, -master,yarn, -conf,spark.yarn.submit.waitAppCompletion=true, -executor-memory,1g, s3://k-test-bucket-input/word_count.py, s3://k-test-bucket-input/input/a.csv, s3://k-test-bucket-input/output/ ] --auto-terminate

Error message

zsh: bad pattern: Args=[

What I tried :

I looked at the args and the spaces and if accidental characters are introduced or not but does not look like. Surely my syntax is wrong but not sure what I am missing.

What command is expected to do:

its expected to execute word_count.py by reading the input file a.csv and generating the output in b.csv

Upvotes: 0

Views: 724

Answers (2)

Shubham Jain
Shubham Jain

Reputation: 5526

Try enclosing everything in quotes

aws emr create-cluster \
    --name "test1-cluster" \
    --release-label emr-5.5.0 \
    --name pyspark_analysis \
    --ec2-attributes KeyName=k-key-pair \
    --applications Name=Hadoop Name=Hive Name=Spark \
    --instance-groups --use-default-roles \
    --instance-type m5-xlarge --instance-count 2 \
    --region us-east-1 --log-uri s3://k-test-bucket-input/logs/ \
    --steps Type="SPARK",Name="pyspark_analysis",ActionOnFailure="CONTINUE",Args=[-deploy-mode,cluster,-master,yarn,-conf,spark.yarn.submit.waitAppCompletion=true,-executor-memory,1g,s3://k-test-bucket-input/word_count.py,s3://k-test-bucket-input/input/a.csv,s3://k-test-bucket-input/output/] \
    --auto-terminate

Visit here for more info https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-submit-step.html

and yes spark can be used

aws emr create-cluster --name "Add Spark Step Cluster" --release-label emr-5.30.1 --applications Name=Spark \
--ec2-attributes KeyName=myKey --instance-type m5.xlarge --instance-count 3 \
--steps Type=Spark,Name="Spark Program",ActionOnFailure=CONTINUE,Args=[--class,org.apache.spark.examples.SparkPi,/usr/lib/spark/examples/jars/spark-examples.jar,10] --use-default-roles

Upvotes: 1

Marcin
Marcin

Reputation: 238209

I think the issue is with the use of spaces in --steps. I formatted the command, so its a bit easier to read where are the spaces (or luck of them):

aws emr create-cluster \
    --name "test1-cluster" \
    --release-label emr-5.5.0 \
    --name pyspark_analysis \
    --ec2-attributes KeyName=k-key-pair \
    --applications Name=Hadoop Name=Hive Name=Spark \
    --instance-groups --use-default-roles \
    --instance-type m5-xlarge --instance-count 2 \
    --region us-east-1 --log-uri s3://k-test-bucket-input/logs/ \
    --steps Type=SPARK,Name="pyspark_analysis",ActionOnFailure=CONTINUE,Args=[-deploy-mode,cluster,-master,yarn,-conf,spark.yarn.submit.waitAppCompletion=true,-executor-memory,1g,s3://k-test-bucket-input/word_count.py,s3://k-test-bucket-input/input/a.csv,s3://k-test-bucket-input/output/] \
    --auto-terminate

Upvotes: 2

Related Questions