Reputation: 883
I am trying to run a simple Java Spark job using Oozie on an EMR cluster. The job just takes files from an input path, does few basic actions on it and places the result in different output path.
When I try to run it from command line using spark-submit as shown below, it works fine:
spark-submit --class com.someClassName --master yarn --deploy-mode cluster /home/hadoop/some-local-path/my-jar-file.jar yarn s3n://input-path s3n://output-path
Then I set up the same thing in an Oozie workflow. However, when run from there the job always fails. The stdout log contains this line:
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, Attempt to add (hdfs://[emr-cluster]:8020/user/oozie/workflows/[WF-Name]/lib/[my-jar-file].jar) multiple times to the distributed cache.
java.lang.IllegalArgumentException: Attempt to add (hdfs://[emr-cluster]:8020/user/oozie/workflows/[WF-Name]/lib/[my-jar-file].jar) multiple times to the distributed cache.
I found a KB note and another question here on StackOverflow that deals with a similar error. But for them, the job was failing due to an internal JAR file - not the one the user is passing to run. Nonetheless, I tried out its resolution steps to remove jar files common between spark & oozie in share-lib and ended up removing a few files from "/user/oozie/share/lib/lib_*/spark". Unfortunately, that did not solve the problem either.
Any ideas on how to debug this issue?
Upvotes: 2
Views: 4533
Reputation: 883
So we finally figured out the issue - at least in our case.
While creating the Workflow using Hue, when a Spark Action is added, it by default prompts for "File" and "Jar/py name". We provided the path to the JAR file that we wanted to run and the name of that JAR file respectively in those fields and it created the basic action as seen below:
The final XML that it created was as follows:
<action name="spark-210e">
<spark xmlns="uri:oozie:spark-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>yarn</master>
<mode>cluster</mode>
<name>CleanseData</name>
<class>com.data.CleanseData</class>
<jar>JCleanseData.jar</jar>
<spark-opts>--driver-memory 2G --executor-memory 2G --num-executors 10 --files hive-site.xml</spark-opts>
<arg>yarn</arg>
<arg>[someArg1]</arg>
<arg>[someArg2]</arg>
<file>lib/JCleanseData.jar#JCleanseData.jar</file>
</spark>
<ok to="[nextAction]"/>
<error to="Kill"/>
</action>
The default file
tag in it was causing the issue in our case.
So, we removed it and edited the definition to as seen below and that worked. Note the change to <jar>
tag as well.
<action name="spark-210e">
<spark xmlns="uri:oozie:spark-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>yarn</master>
<mode>cluster</mode>
<name>CleanseData</name>
<class>com.data.CleanseData</class>
<jar>hdfs://path/to/JCleanseData.jar</jar>
<spark-opts>--driver-memory 2G --executor-memory 2G --num-executors 10 --files hive-site.xml</spark-opts>
<arg>yarn</arg>
<arg>[someArg1]</arg>
<arg>[someArg1]</arg>
</spark>
<ok to="[nextAction]"/>
<error to="Kill"/>
</action>
PS: We had a similar issue with Hive actions too. The hive-site.xml
file we were supposed to pass with Hive action - which created a <job-xml>
tag - was also causing issues. So we removed it and it worked as expected.
Upvotes: 1