Hind Forsum
Hind Forsum

Reputation: 10497

hadoop submits job with class name, why is job.setJarByClass() necessary?

E.g. I've got a hadoop word-count program(from internet) , WordCount.java:

public static class WordCount{
    public static void main(String[] args)throws Exception{
    ....
        Job job = Job.getInstance(new Configuration(), "word count");
        job.setJarByClass(WordCount.class); //Why?
    }
}

Ccompile it into a jar and submit it to yarn like this:

hadoop jar wordcount.jar WordCount [input-hdfs] [output-hdfs]

In this command, we have specified:

(1) jar name (2) class name

As long as

  1. hadoop already know from its command line "WordCount" is the class name from wordcount.jar.

  2. The public class of WordCount.java is always WordCount, this is java standard, right?

Then what's the point of calling

setJarByClass(WordCount.class)

Seems to me it's redundant. Why is this statement required? Thanks

Upvotes: 0

Views: 361

Answers (1)

OneCricketeer
OneCricketeer

Reputation: 191701

You can have more than one main method in a single JAR file, therefore the classname is necessary unless you add a manifest file to the JAR.

And your job.set class doesn't need to be the same class with the main method, but Hadoop can't otherwise automatically know which class you want to run for the job, therefore you need to set the class in the code as well

You could do something like Class.forName(args[2]) if you did want to get the class from the CLI, though

Upvotes: 1

Related Questions