Reputation: 429
I'm trying to create an EMR cluster using AWS CLI to run a python script (uses pyspark) as follows:
aws emr create-cluster --name "emr cluster for pyspark (test)"\
--applications Name=Spark Name=Hadoop --release-label emr-5.25.0 --use-default-roles \
--ec2-attributes KeyName=my-key --instance-groups \
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.xlarge \
InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.xlarge \
--bootstrap-actions Path="s3://mybucket/my_bootstrap.sh" --steps \
Type=CUSTOM_JAR,Name="Spark Count group by QRACE",ActionOnFailure=CONTINUE\
,Jar=s3://us-east-2.elasticmapreduce/libs/script-runner/script-runner.jar,\
Args=["s3://mybucket/my_step.py","s3://mybucket/my_input.txt","s3://mybucket/output"]\
--log-uri "s3://mybucket/logs"
The bootstrap script sets up Python3.7, installs pyspark (2.4.3) and installs Java 8. However, my script fails with the following error:
y4j.protocol.Py4JJavaError: An error occurred while calling o32.csv.
: java.lang.RuntimeException:
java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
I've tried adding a --configurations
argument with the following json file to the create-cluster
command (but it did not help):
[
{
"Classification":"spark-defaults",
"Properties":{
"spark.executor.extraClassPath":"/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*",
"spark.driver.extraClassPath":"/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*"
}
}
]
Any pointers as to where I could look or what I could do would be very helpful!
EDIT: I was able to fix this issue by following the suggestions of @Lamanus. But my PySpark application seems to run perfectly on EMR 5.30.1 but not on EMR 5.25.0
I am now getting the following error:
Exception in thread "main" org.apache.spark.SparkException: Application application_1596402225924_0001 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1148)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1525)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I'm not sure where to look for a helpful error report /log for what went wrong. It works perfectly with EMR-5.30.1 and Spark-2.4.5.
Upvotes: 6
Views: 9574
Reputation: 1
Was not able to vote on the last answer by @chittychitty, but this is correct! Do not install PySpark over the one provided by EMR.
Upvotes: 0
Reputation: 429
Update: this happened because bootstrap script installed pyspark when the cluster already came with one.
Upvotes: 6