How to run a spark action (a pyspark script) on oozie 4.2.0?

Question

When I submit a python script as jar to spark action in oozie, I see the below error :

Traceback (most recent call last):
  File "/home/hadoop/spark.py", line 5, in 
    from pyspark import SparkContext, SparkConf
ImportError: No module named pyspark
Intercepting System.exit(1)

Although I can see that the pyspark libraries exist on my local FS :

$ ls /usr/lib/spark/python/pyspark/
accumulators.py     heapq3.py           rdd.py              statcounter.py
broadcast.py        __init__.py         rddsampler.py       status.py
cloudpickle.py      java_gateway.py     resultiterable.py   storagelevel.py
conf.py             join.py             serializers.py      streaming/
context.py          ml/                 shell.py            tests.py
daemon.py           mllib/              shuffle.py          traceback_utils.py
files.py            profiler.py         sql/                worker.py

I know that there were issues with running pyspark on oozie like https://issues.apache.org/jira/browse/OOZIE-2482 but the error I am seeing is different from the JIRA ticket.

Also I am passing --conf spark.yarn.appMasterEnv.SPARK_HOME=/usr/lib/spark --conf spark.executorEnv.SPARK_HOME=/usr/lib/spark as spark-opts in my workflow definition.

Here is my sample application for reference :

job.properties

masterNode ip-xxx-xx-xx-xx.ec2.internal
nameNode hdfs://${masterNode}:8020
jobTracker ${masterNode}:8032
master yarn
mode client
queueName default
oozie.libpath ${nameNode}/user/oozie/share/lib
oozie.use.system.libpath true
oozie.wf.application.path /user/oozie/apps/

workflow.xml (at ${nameNode}/user/oozie/apps/)

 
     
     
         
            ${jobTracker} 
            ${nameNode} 
              
                 
                    mapred.compress.map.output 
                    true 
                 
             
            ${master} 
            ${mode}
            Spark Example
            /home/hadoop/spark.py
            --driver-memory 512m --executor-memory 512m --num-executors 4 --conf spark.yarn.appMasterEnv.SPARK_HOME=/usr/lib/spark --conf spark.executorEnv.SPARK_HOME=/usr/lib/spark --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=/usr/lib/spark/python --conf spark.executorEnv.PYTHONPATH=/usr/lib/spark/python --files ${nameNode}/user/oozie/apps/hive-site.xml
         
         
         
     
     
        Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]

spark.py (at /home/hadoop/)

# sc is an existing SparkContext.
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
conf = SparkConf().setAppName('test_pyspark_oozie')
sc = SparkContext(conf=conf)

sqlContext = HiveContext(sc)


sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")

As recommended here - http://www.learn4master.com/big-data/pyspark/run-pyspark-on-oozie, I also did put the following two files: py4j-0.9-src.zip pyspark.zip, under my ${nameNode}/user/oozie/share/lib folder.

I am using a single-node YARN cluster (AWS EMR) & trying to find out I can pass these pyspark modules to python in my oozie application. Any help is appreciated.

Nilesh Shaikh · Accepted Answer

You are getting No module named error because you have not mentioned PYTHONPATH in your configuration. Add one more line in --conf with PYTHONPATH=/usr/lib/spark/python. I don't know how to set this PYTHONPATH in oozie workflow defination but by adding PYTHONPATH property in your configuration will definitely solve your problem.

How to run a spark action (a pyspark script) on oozie 4.2.0?

job.properties

workflow.xml (at ${nameNode}/user/oozie/apps/)

spark.py (at /home/hadoop/)

Answers (1)

Related Questions