blueberryfields
blueberryfields

Reputation: 50488

What is the correct method to spark-submit python applications on aws emr?

I'm connected to the master node of a Spark cluster, running inside of emr, and am trying to submit a python based application:

spark-submit --verbose --deploy-mode cluster --master yarn-cluster --num-executors 3 --executor-cores 6 --executor-memory 1g test.py 

The process produces a set of logs dump, including the following confirmation of deployment to the cluster:

6/08/29 20:47:51 INFO Client: Uploading resource file:/home/hadoop/test.py -> hdfs://ip-xxx-xxx-xxx-xxx.ec2.internal:8020/user/hadoop/.sparkStaging/application_1472396426409_0007/test.py
16/08/29 20:47:51 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://ip-xxx-xxx-xxx-xxx.ec2.internal:8020/user/hadoop/.sparkStaging/application_1472396426409_0007/pyspark.zip
16/08/29 20:47:51 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/py4j-0.10.1-src.zip -> hdfs://ip-xxx-xxx-xxx-xxx.ec2.internal:8020/user/hadoop/.sparkStaging/application_1472396426409_0007/py4j-0.10.1-src.zip

Yet, the application fails to run, reporting a missing py4j library? :

6/08/29 20:48:47 INFO Client: Application report for application_1472396426409_0007 (state: ACCEPTED)
16/08/29 20:48:48 INFO Client: Application report for application_1472396426409_0007 (state: FAILED)
16/08/29 20:48:48 INFO Client: 
     client token: N/A
     diagnostics: Application application_1472396426409_0007 failed 2 times due to AM Container for appattempt_1472396426409_0007_000002 exited with  exitCode: -1000
For more detailed output, check application tracking page:http://ip-xxx-xxx-xxx-xxx.ec2.internal:8088/cluster/app/application_1472396426409_0007Then, click on links to logs of each attempt.
Diagnostics: File does not exist: hdfs://ip-xxx-xxx-xxx-xxx.ec2.internal:8020/user/hadoop/.sparkStaging/application_1472396426409_0007/py4j-0.10.1-src.zip
java.io.FileNotFoundException: File does not exist: hdfs://ip-xxx-xxx-xxx-xxx.ec2.internal:8020/user/hadoop/.sparkStaging/application_1472396426409_0007/py4j-0.10.1-src.zip
    at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309)
    at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)

Am I misusing the command or something?

Upvotes: 3

Views: 819

Answers (1)

blueberryfields
blueberryfields

Reputation: 50488

This appears to be a bug with the aws system. Yarn monitors the system and notices that the deployed code is no longer there - which is really a sign that spark is done processing.

To verify that this is the problem, double check by reading the logs for your application - ie, run something like this against your master node:

yarn logs -applicationId application_1472396426409_0007

and double check that you see a success message in the logs:

INFO ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED

Upvotes: 1

Related Questions