Rohit Anil
Rohit Anil

Reputation: 314

Add pyspark script as AWS step

I have a pyspark script to read an xml file(present in S3). I need to add this as a step in aws. I have used the following command

aws emr add-steps — cluster-id <cluster_id> — steps Type=spark,Name=POI,Args=[ — deploy-mode,cluster, — master,yarn, — conf,spark.yarn.submit.waitAppCompletion=true,<s3 location of pyspark script>],ActionOnFailure=CONTINUE

I have downloaded the spark-xml jar to the master node during bootstrap and its present under

/home/hadoop

location. Also in the python script I have included

conf = SparkConf().setAppName('Project').set("spark.jars", "/home/hadoop/spark-xml_2.11-0.4.1.jar").set("spark.driver.extraClassPath", "/home/hadoop/spark-xml_2.11-0.4.1.jar")

But still its showing

py4j.protocol.Py4JJavaError: An error occurred while calling o56.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark.apache.org/third-party-projects.html

Upvotes: 0

Views: 106

Answers (1)

SnigJi
SnigJi

Reputation: 1410

You have set master as yarn and deploy-mode as cluster. That means your spark driver will be in one of CORE nodes. Anyway, EMR by default is configured to create Application master on one of the CORE node and application master will have the driver in it.

Please refer this article for more info.

So you have to put your jar in all CORE nodes (Not in MASTER) and refer the file file:///home/hadoop/spark-xml_2.11-0.4.1.jar in this manner.

Or there is better way to put it in HDFS (Lets say under hdfs:///user/hadoop) and refer that hdfs:///user/hadoop/spark-xml_2.11-0.4.1.jar

Upvotes: 1

Related Questions