Reputation: 50488
I'm connected to the master node of a Spark cluster, running inside of emr, and am trying to submit a python based application:
spark-submit --verbose --deploy-mode cluster --master yarn-cluster --num-executors 3 --executor-cores 6 --executor-memory 1g test.py
The process produces a set of logs dump, including the following confirmation of deployment to the cluster:
6/08/29 20:47:51 INFO Client: Uploading resource file:/home/hadoop/test.py -> hdfs://ip-xxx-xxx-xxx-xxx.ec2.internal:8020/user/hadoop/.sparkStaging/application_1472396426409_0007/test.py
16/08/29 20:47:51 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://ip-xxx-xxx-xxx-xxx.ec2.internal:8020/user/hadoop/.sparkStaging/application_1472396426409_0007/pyspark.zip
16/08/29 20:47:51 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/py4j-0.10.1-src.zip -> hdfs://ip-xxx-xxx-xxx-xxx.ec2.internal:8020/user/hadoop/.sparkStaging/application_1472396426409_0007/py4j-0.10.1-src.zip
Yet, the application fails to run, reporting a missing py4j library? :
6/08/29 20:48:47 INFO Client: Application report for application_1472396426409_0007 (state: ACCEPTED)
16/08/29 20:48:48 INFO Client: Application report for application_1472396426409_0007 (state: FAILED)
16/08/29 20:48:48 INFO Client:
client token: N/A
diagnostics: Application application_1472396426409_0007 failed 2 times due to AM Container for appattempt_1472396426409_0007_000002 exited with exitCode: -1000
For more detailed output, check application tracking page:http://ip-xxx-xxx-xxx-xxx.ec2.internal:8088/cluster/app/application_1472396426409_0007Then, click on links to logs of each attempt.
Diagnostics: File does not exist: hdfs://ip-xxx-xxx-xxx-xxx.ec2.internal:8020/user/hadoop/.sparkStaging/application_1472396426409_0007/py4j-0.10.1-src.zip
java.io.FileNotFoundException: File does not exist: hdfs://ip-xxx-xxx-xxx-xxx.ec2.internal:8020/user/hadoop/.sparkStaging/application_1472396426409_0007/py4j-0.10.1-src.zip
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
Am I misusing the command or something?
Upvotes: 3
Views: 819
Reputation: 50488
This appears to be a bug with the aws system. Yarn monitors the system and notices that the deployed code is no longer there - which is really a sign that spark is done processing.
To verify that this is the problem, double check by reading the logs for your application - ie, run something like this against your master node:
yarn logs -applicationId application_1472396426409_0007
and double check that you see a success message in the logs:
INFO ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED
Upvotes: 1