Hilmi
Hilmi

Reputation: 57

Running Spark2 from Oozie (CDH)

I am attempting to run a spark job (using spark2-submit) from Oozie, so this job can be run on a schedule.

The job runs just fine when running we run the shell script from command-line under our service account (not Yarn). When we run it as a Oozie Workflow the following happens:

17/11/16 12:03:55 ERROR spark.SparkContext: Error initializing SparkContext.
org.apache.hadoop.security.AccessControlException: Permission denied: 
user=yarn, access=WRITE, inode="/user":hdfs:supergroup:drwxrwxr-x

Oozie is running the job as the user Yarn. IT has denied us any ability to change Yarn's permissions in HDFS, and there is not a single reference to the user directory in the Spark script. We have attempted to ssh into the server - though this doesn't work - we have to ssh out of our worker nodes, onto the master.

The shell script:

spark2-submit --name "SparkRunner" --master yarn --deploy-mode client --class org.package-name.Runner  hdfs://manager-node-hdfs/Analytics/Spark_jars/SparkRunner.jar

Any help would be appreciated.

Upvotes: 0

Views: 2951

Answers (3)

Samson Scharfrichter
Samson Scharfrichter

Reputation: 9067

From Launching Spark (2.1) on YARN...

spark.yarn.stagingDir
Staging directory used while submitting applications
Default: current user's home directory in the filesystem

So, if you can create an HDFS directory somewhere, and grant yarn the required privs -- i.e. rx on all parent dirs and rwx on the dir itself -- then request Spark to use that dir instead of /user/yarn (which does not exist) then you should be fine.

Upvotes: 0

Hilmi
Hilmi

Reputation: 57

I was able to fix this by following https://stackoverflow.com/a/32834087/8099994

At the beginning of my shell script I now include the following line:

export HADOOP_USER_NAME=serviceAccount;

Upvotes: 0

Amit Kumar
Amit Kumar

Reputation: 1584

You need to add "<env-var>HADOOP_USER_NAME=${wf:user()}</env-var>" into the shell action of your oozie workflow.xml. So that oozie uses the home directory of the user which has triggred the oozie worklfow rather than using the yarn home directory.

e.g

<action name='shellaction'>
        <shell xmlns="uri:oozie:shell-action:0.1">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <exec>sparksubmitShellScript.sh</exec>
            <argument>${providearg}</argument>
            <env-var>HADOOP_USER_NAME=${wf:user()}</env-var>
            <file>${appPath}/sparksubmitShellScript.sh#sparksubmitShellScript.sh
            </file>
        </shell>
    </action>

Modify as per your workflow if required you can directly mention the user name as well rather than using the user which triggered the workflow as below

<env-var>HADOOP_USER_NAME=${userName}</env-var>

specify userName=usernamevalue in your job.properties

Upvotes: 1

Related Questions