launching a spark program using oozie workflow

Question

I am working with a scala program using spark packages. Currently I run the program using the bash command from the gateway: /homes/spark/bin/spark-submit --master yarn-cluster --class "com.xxx.yyy.zzz" --driver-java-options "-Dyyy.num=5" a.jar arg1 arg2

I would like to start using oozie for running this job. I have a few setbacks:

Where should I put the spark-submit executable? on the hfs? How do I define the spark action? where should the --driver-java-options appear? How should the oozie action look like? is it similar to the one appearing here?

nurieta · Accepted Answer

If you have a new enough version of oozie you can use oozie's spark task:

https://github.com/apache/oozie/blob/master/client/src/main/resources/spark-action-0.1.xsd

Otherwise you need to execute a java task that will call spark. Something like:

   
        org.apache.spark.deploy.SparkSubmit

        --class
        ${spark_main_class} -> this is the class com.xxx.yyy.zzz

        --deploy-mode
        cluster

        --master
        yarn

        --queue
        ${queue_name} -> depends on your oozie config

        --num-executors
        ${spark_num_executors}

        --executor-cores
        ${spark_executor_cores}

        ${spark_app_file} -> jar that contains your spark job, written in scala

        ${input} -> some arg 
        ${output}-> some other arg

        ${spark_app_file}

        ${name_node}/user/spark/share/lib/spark-assembly.jar

launching a spark program using oozie workflow

Answers (1)

Related Questions