jdevelop
jdevelop

Reputation: 12296

Hadoop - submit a job with lots of dependencies (jar files)

I want to write some sort of "bootstrap" class, which will watch MQ for incoming messages and submit map/reduce jobs to Hadoop. These jobs use some external libraries heavily. For the moment I have the implementation of these jobs, packaged as ZIP file with bin,lib and log folders (I'm using maven-assembly-plugin to tie things together).

Now I want to provide small wrappers for Mapper and Reducer, which will use parts of the existing application.

As far as I learned, when a job is submitted, Hadoop tries to find out JAR file, which has the mapper/reducer classes, and copy this jar over network to data node, which will be used to process the data. But it's not clear how do I tell Hadoop to copy all dependencies?

I could use maven-shade-plugin to create an uber-jar with the job and dependencies, And another jar for bootstrap (which jar would be executed with hadoop shell-script).

Please advice.

Upvotes: 1

Views: 5907

Answers (3)

Vladimir Kroz
Vladimir Kroz

Reputation: 5367

Use -libjars option of hadoop launcher script for specify dependencies for jobs running on remotes JVMs; Use $HADOOP_CLASSPATH variable for set dependencies for JobClient running on local JVM

Detailed discussion is here: http://grepalex.com/2013/02/25/hadoop-libjars/

Upvotes: 0

Chris Gerken
Chris Gerken

Reputation: 16392

Use maven to manage the dependencies and ensure the correct versions are used during builds and deployment. Popular IDE's have maven support that makes it so you don't have to worry about building class paths for edit and build. Finally, you can instruct maven to build a single jar (a "jar-with-dependencies") containing your app and all dependencies, making deployment very easy.

As for dependencies, like hadoop, which are guaranteed to be in the runtime class path, you can define them with a scope of "provided" so they're not included in the uber jar.

Upvotes: 0

Tariq
Tariq

Reputation: 34184

One way could be to put the required jars in distributed cache. Another alternative would be to install all the required jars on the Hadoop nodes and tell TaskTrackers about their location. I would suggest you to go through this post once. Talks about the same issue.

Upvotes: 2

Related Questions