Reputation: 18995
I'm new to Spark and having trouble replicating the example in the EMR docs for submitting a basic user application with spark-submit
via AWS CLI. It seems to run without error but produces no output. Is something wrong with my syntax for add-steps in the below workflow?
The objective is to count words in a document in S3, for this example 1000 words of lorem-ipsum:
$ aws s3 cp s3://projects/wordcount/input/some_document.txt - | head -n1
Lorem ipsum dolor sit amet, consectetur adipiscing [... etc.]
Copying from the docs, the python script looks like:
$ aws s3 cp s3://projects/wordcount/wordcount.py -
from __future__ import print_function
from pyspark import SparkContext
import sys
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: wordcount ", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="WordCount")
text_file = sc.textFile(sys.argv[1])
counts = text_file.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile(sys.argv[2])
sc.stop()
The destination folder (currently empty):
$ aws s3 ls s3://projects/wordcount/output
PRE output/
Working from doc, I have a running cluster with logging, as produced by:
aws emr create-cluster --name TestSparkCluster \
--release-label emr-5.11.0 --applications Name=Spark \
--enable-debugging --log-uri s3://projects/wordcount/log \
--instance-type m3.xlarge --instance-count 3 --use-default-roles
with return message showing created {"ClusterID": "j-XXXXXXXXXXXXX"}
Looking directly at the example, I submit add-steps
as:
aws emr add-steps --cluster-id j-XXXXXXXXXXXXX \
--steps Type=spark,Name=SparkWordCountApp,\
Args=[--deploy-mode,cluster,--master,yarn,\
--conf,spark.yarn.submit.waitAppCompletion=false,\
--num-executors,2,--executor-cores,2,--executor-memory,1g,\
s3://projects/wordcount/wordcount.py,\
s3://projects/wordcount/input/some_document.txt,\
s3://projects/wordcount/output/],\
ActionOnFailure=CONTINUE
which launches { "StepIds":["s-YYYYYYYYYYY"] }
The output folder is empty -- why?
I verify that step SparkWordCountApp
with ID s-YYYYYYYYYYY
has Status:Completed
in the EMR console.
From the console, I check the controller log file and the stderr output (below) to verify that the step completed with exit status 0.
In the Spark documentation, a somewhat different syntax is used. Instead of making the script name the first position of the arguments list, it says:
For Python applications, simply pass a .py file in the place of instead of a JAR, and add Python .zip, .egg or .py files to the search path with --py-files.
However, the example uses sys.argv
, where sys.argv[0] is wordcount.py
Controller log file:
2018-01-24T15:54:05.945Z INFO Ensure step 3 jar file command-runner.jar
2018-01-24T15:54:05.945Z INFO StepRunner: Created Runner for step 3
INFO startExec 'hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --deploy-mode cluster --master yarn --conf spark.yarn.submit.waitAppCompletion=false --num-executors 2 --executor-cores 2 --executor-memory 1g s3://projects/wordcount/wordcount.py s3://projects/wordcount/input/some_document.txt s3://projects/wordcount/output/'
INFO Environment:
PATH=/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/sbin:/opt/aws/bin
LESS_TERMCAP_md=[01;38;5;208m
LESS_TERMCAP_me=[0m
HISTCONTROL=ignoredups
LESS_TERMCAP_mb=[01;31m
AWS_AUTO_SCALING_HOME=/opt/aws/apitools/as
UPSTART_JOB=rc
LESS_TERMCAP_se=[0m
HISTSIZE=1000
HADOOP_ROOT_LOGGER=INFO,DRFA
JAVA_HOME=/etc/alternatives/jre
AWS_DEFAULT_REGION=us-west-2
AWS_ELB_HOME=/opt/aws/apitools/elb
LESS_TERMCAP_us=[04;38;5;111m
EC2_HOME=/opt/aws/apitools/ec2
TERM=linux
XFILESEARCHPATH=/usr/dt/app-defaults/%L/Dt
runlevel=3
LANG=en_US.UTF-8
AWS_CLOUDWATCH_HOME=/opt/aws/apitools/mon
MAIL=/var/spool/mail/hadoop
LESS_TERMCAP_ue=[0m
LOGNAME=hadoop
PWD=/
LANGSH_SOURCED=1
HADOOP_CLIENT_OPTS=-Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/s-29XVS3IGMSK1/tmp
_=/etc/alternatives/jre/bin/java
CONSOLETYPE=serial
RUNLEVEL=3
LESSOPEN=||/usr/bin/lesspipe.sh %s
previous=N
UPSTART_EVENTS=runlevel
AWS_PATH=/opt/aws
USER=hadoop
UPSTART_INSTANCE=
PREVLEVEL=N
HADOOP_LOGFILE=syslog
PYTHON_INSTALL_LAYOUT=amzn
HOSTNAME=ip-172-31-12-232
NLSPATH=/usr/dt/lib/nls/msg/%L/%N.cat
HADOOP_LOG_DIR=/mnt/var/log/hadoop/steps/s-29XVS3IGMSK1
EC2_AMITOOL_HOME=/opt/aws/amitools/ec2
SHLVL=5
HOME=/home/hadoop
HADOOP_IDENT_STRING=hadoop
INFO redirectOutput to /mnt/var/log/hadoop/steps/s-29XVS3IGMSK1/stdout
INFO redirectError to /mnt/var/log/hadoop/steps/s-29XVS3IGMSK1/stderr
INFO Working dir /mnt/var/lib/hadoop/steps/s-29XVS3IGMSK1
INFO ProcessRunner started child process 20797 :
hadoop 20797 3347 0 15:54 ? 00:00:00 bash /usr/lib/hadoop/bin/hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --deploy-mode cluster --master yarn --conf spark.yarn.submit.waitAppCompletion=false --num-executors 2 --executor-cores 2 --executor-memory 1g s3://projects/wordcount/wordcount.py s3://projects/wordcount/input/some_document.txt s3://projects/wordcount/output/
2018-01-24T15:54:09.956Z INFO HadoopJarStepRunner.Runner: startRun() called for s-29XVS3IGMSK1 Child Pid: 20797
INFO Synchronously wait child process to complete : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO waitProcessCompletion ended with exit code 0 : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO total process run time: 16 seconds
2018-01-24T15:54:24.072Z INFO Step created jobs:
2018-01-24T15:54:24.072Z INFO Step succeeded with exitCode 0 and took 16 seconds
Stderr log file:
18/01/24 15:54:12 INFO RMProxy: Connecting to ResourceManager at ip-XXX-YY-YY-ZZZ.us-west-2.compute.internal/XXX.YY.YY.ZZZ:8032
18/01/24 15:54:12 INFO Client: Requesting a new application from cluster with 2 NodeManagers
18/01/24 15:54:12 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (11520 MB per container)
18/01/24 15:54:12 INFO Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
18/01/24 15:54:12 INFO Client: Setting up container launch context for our AM
18/01/24 15:54:12 INFO Client: Setting up the launch environment for our AM container
18/01/24 15:54:12 INFO Client: Preparing resources for our AM container
18/01/24 15:54:14 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
18/01/24 15:54:18 INFO Client: Uploading resource file:/mnt/tmp/spark-89654b91-c4db-4847-aa4b-22f27240daf7/__spark_libs__8429498492477236801.zip -> hdfs://ip-XXX-YY-YY-ZZZ.us-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1516806627838_0002/__spark_libs__8429498492477236801.zip
18/01/24 15:54:22 INFO Client: Uploading resource s3://projects/wordcount/wordcount.py -> hdfs://ip-XXX-YY-YY-ZZZ.us-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1516806627838_0002/wordcount.py
18/01/24 15:54:22 INFO S3NativeFileSystem: Opening 's3://projects/wordcount/wordcount.py' for reading
18/01/24 15:54:22 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://ip-XXX-YY-YY-ZZZ.us-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1516806627838_0002/pyspark.zip
18/01/24 15:54:22 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/py4j-0.10.4-src.zip -> hdfs://ip-XXX-YY-YY-ZZZ.us-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1516806627838_0002/py4j-0.10.4-src.zip
18/01/24 15:54:22 INFO Client: Uploading resource file:/mnt/tmp/spark-89654b91-c4db-4847-aa4b-22f27240daf7/__spark_conf__8267377904454396581.zip -> hdfs://ip-XXX-YY-YY-ZZZ.us-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1516806627838_0002/__spark_conf__.zip
18/01/24 15:54:22 INFO SecurityManager: Changing view acls to: hadoop
18/01/24 15:54:22 INFO SecurityManager: Changing modify acls to: hadoop
18/01/24 15:54:22 INFO SecurityManager: Changing view acls groups to:
18/01/24 15:54:22 INFO SecurityManager: Changing modify acls groups to:
18/01/24 15:54:22 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set()
18/01/24 15:54:22 INFO Client: Submitting application application_1516806627838_0002 to ResourceManager
18/01/24 15:54:23 INFO YarnClientImpl: Submitted application application_1516806627838_0002
18/01/24 15:54:23 INFO Client: Application report for application_1516806627838_0002 (state: ACCEPTED)
18/01/24 15:54:23 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1516809262990
final status: UNDEFINED
tracking URL: http://ip-XXX-YY-YY-ZZZ.us-west-2.compute.internal:20888/proxy/application_1516806627838_0002/
user: hadoop
18/01/24 15:54:23 INFO ShutdownHookManager: Shutdown hook called
18/01/24 15:54:23 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-89654b91-c4db-4847-aa4b-22f27240daf7
Command exiting with ret '0'
Upvotes: 4
Views: 10429
Reputation: 21
The following add-step worked for me, ran it from the master node:
aws emr add-steps --cluster-id yourclusterid --steps Type=spark,Name=SparkWordCountApp,Args=[--deploy-mode,cluster,--master,yarn, --conf,spark.yarn.submit.waitAppCompletion=false,--num-executors,2,--executor-cores,2,--executor-memory,1g,s3://yourbucketname/yourcode.py,s3://yourbucketname/yourinputfile.txt,s3://yourbucketname/youroutputfile.out], ActionOnFailure=CONTINUE
Upvotes: 0
Reputation: 18995
It turns out that the problem was caused by the destination folder already existing (even though empty). Deleting the output folder makes the example work.
I figured this out by reading the step logs in S3 rather than the instance logs in the EMR console -- in those logs I saw an org.apache.hadoop.mapred.FileAlreadyExistsException
which clued me in.
A pre-existing S3 folder isn't a problem for other writing tasks I've done (e.g., PigStorage
) so I wasn't expecting this.
I'm going to leave this question up in the (unlikely) event anyone else runs into this.
Upvotes: 4