Hadoop - set custom jdk path/version in job configuration

Question

I have a mapreduce jar file that requires JDK 1.8 and a Hadoop cluster that has JDK 1.7 installed and configured.

Is it possible to run my jar without changing the Hadoop configuration (i.e. no change to hadoop-env.sh)?

All of the Hadoop nodes also have access to JDK 1.8 and I can easily change JAVA_HOME to point to JDK 1.8 but that does not seem to have any effect without changes to the Hadoop environment variables.

I have already looked at submitting Hadoop job with ProcessBuilder and running mapreduce job from simple Java program but it's not clear from that how to deal with the job configuration that you normally have. For instance, I am using this to run my Hadoop job:

hadoop jar MyJar.jar -libjars somelibrary.jar input_folder output_folder

and I have my main class (which is the entry point already configured in the manifest of the jar) that performs the job configuration as such:

public class MyJobMapReduce extends Configured implements Tool {

 public static void main(String[] args) throws Exception {
     int res = ToolRunner.run(new Configuration(), new MyJobMapReduce(), args);
        System.exit(res);
 }

 @Override
 public int run(String[] args) throws Exception {
     Configuration conf = this.getConf();

     Job job = Job.getInstance(conf, "myjob");
     String inputPath = args[0];
     String outputPath = args[1];
     String inputType = args[2];
     boolean readFolder = Boolean.valueOf(args[3]);
     boolean compressOutput = Boolean.valueOf(args[4]);

     job.setNumReduceTasks(50);
     // input
     if (readFolder)
         FileInputFormat.setInputDirRecursive(job, true);
     FileInputFormat.addInputPath(job, new Path(inputPath));
     job.setInputFormatClass(TextInputFormat.class);

     // output
     job.setOutputFormatClass(TextOutputFormat.class);
     FileOutputFormat.setOutputPath(job, new Path(outputPath));
     if (compressOutput) {
         FileOutputFormat.setCompressOutput(job, true);
         FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
     }

     // entry point
     job.setJarByClass(MyMapReduce.class);      


     // mapper
     job.setMapperClass(BuildSyntacticTreeMapper.class);
     job.setMapOutputKeyClass(IntWritable.class);
     job.setMapOutputValueClass(Text.class);
     // reducer
     job.setReducerClass(DataDumpReducer.class);
     job.setOutputKeyClass(Text.class);
     job.setOutputValueClass(Text.class);

     return job.waitForCompletion(true) ? 0 : 1;
 }

}

P.S. my job requires JDK 1.8 because one of the libraries that I am using with -libjars requires it.

alexandru.asandei89 · Accepted Answer

I managed to find out that this actually doesn't require any change to my program or any custom java launcher. What helped the most was How to run a jar file in hadoop? and determining what the hadoop jar part of my command

hadoop jar MyJar.jar -libjars somelibrary.jar input_folder output_folder

actually did, which was just to set the classpath. Therefore, to run a jar in hadoop, with a different java version than the one configured in hadoop-env.sh:

hadoop classpath

the output is then used in combination with the custom java location, resulting in

/usr/java/jdk1.8.0_45/bin/java -cp {output from hadoop classpath command}:/path/to/MyJar.jar com.my.SomeClass -libjars somelibrary.jar input_folder output_folder

Hadoop - set custom jdk path/version in job configuration

Answers (1)

Related Questions