CRS
CRS

Reputation: 471

Hadoop: External jar folder reference

I have written a simple MapReduce code which will invoke few methods in external jars. I added this jar in the hadoop/lib folder and it is being picked up. All is fine with single node cluster. I have a multinode cluster where I want to run the same code. I would like to know if there is a method where I can copy my jars to dfs so that I do not need to add the jars manually on all nodes. I would like to add all jars within another folder(not hadoop/lib). Is it possible to achieve this? i.e adding an external reference to a folder which has many jars. I followed the cloudera blog to do the same but still it did not help. Any pointers on this will be really helpful. I am using hadoop 1.0.4 version.

P.S: I have added all external jars within the main job jar. Even then it is not being picked up.

Upvotes: 1

Views: 1825

Answers (1)

Chris White
Chris White

Reputation: 30089

There are two mechanisms for including other jars to your job's classpath:

If you haven't already stored them in HDFS, you can use the GenericOptionsParser's -libjars argument. This will cause the JobClient to upload the jars to a temp directory in HDFS for you job, and include them in the distributed cache for your job. For this to work, you'll need to run your job via the ToolRunner.run interface:

public class MyJob extends COnfigured implements Tool {
  public int run(String args[]) {
    Job job = new Job(getConf());
    // configure your job
    // ..

    return job.waitForCompletion() ? 0 : 1;
  }

  public static void main(String args[]) throws Exception {
    ToolRunner.run(new MyJob(), args));
  }
}

Then you'd run your job as follows (adding jars 1-3 to the job classpath):

#> hadoop jar myjob.jar MyJob -libjars jar1.jar,jar2.jar,jar3.jar [other args]

If your Jars are already in HDFS, then you just need to add the jars the distributed cache:

public int run(String args[]) {
  Job job = new Job(getConf());
  // configure your job
  // ..

  // acquire job configuration
  Configuration conf = job.getConf();

  // create a FileSystem
  FileSystem fs = FileSystem.get(fs);

  DistributedCache.addFileToClassPath(new Path("/myapp/jar1.jar"), conf, fs);
  DistributedCache.addFileToClassPath(new Path("/myapp/jar2.jar"), conf, fs);
  DistributedCache.addFileToClassPath(new Path("/myapp/jar3.jar"), conf, fs);

  return job.waitForCompletion() ? 0 : 1;
}

The only downside of this second method is that you cannot reference any class in theses jars in your job configuration (unless you have copies client-side too, and you configure the HADOOP_CLASSPATH env variable).

Upvotes: 2

Related Questions