Add pyspark script as AWS step

Question

I have a pyspark script to read an xml file(present in S3). I need to add this as a step in aws. I have used the following command

aws emr add-steps — cluster-id  — steps Type=spark,Name=POI,Args=[ — deploy-mode,cluster, — master,yarn, — conf,spark.yarn.submit.waitAppCompletion=true,],ActionOnFailure=CONTINUE

I have downloaded the spark-xml jar to the master node during bootstrap and its present under

/home/hadoop

location. Also in the python script I have included

conf = SparkConf().setAppName('Project').set("spark.jars", "/home/hadoop/spark-xml_2.11-0.4.1.jar").set("spark.driver.extraClassPath", "/home/hadoop/spark-xml_2.11-0.4.1.jar")

But still its showing

py4j.protocol.Py4JJavaError: An error occurred while calling o56.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark.apache.org/third-party-projects.html

Add pyspark script as AWS step

Answers (1)

Related Questions