stochasticcrap
stochasticcrap

Reputation: 358

How to build Hadoop Job using Maven

I am new to both maven and hadoop, and would like to know more about how to setup a maven environment so i can build a simple hadoop wordcount job. If the wordcount job consists of the map.java, reduce.java, and driver class wordcount.java where should they be saved so that maven can compile them into a .jar? I also have a pom.xml. I would greatly appreciate it if someone could provide a detailed explanation on how to run the wordcount job using maven in detail. I am currently doing everything over the single node cluster hadoop tar on ubuntu terminal. I found these links that have given me some insight, but I don't fully understand the whole path directory scheme. Specifically, can the names for the group id and artifact id be arbitrary, or does it relate to some path? and what's the deal with the main and src directories? And more generally how to build a the hadoop jar without an ide.

https://azure.microsoft.com/en-us/documentation/articles/hdinsight-develop-deploy-java-mapreduce/

http://www.bogotobogo.com/Hadoop/BigData_hadoop_Creating_Wordcount_Maven_Project_Eclipse_MapReduce.php

Upvotes: 0

Views: 976

Answers (1)

Ramzy
Ramzy

Reputation: 7138

To run map reduce job, all you need a jar with job,mapper and reducer class. Now the point is how to manage the dependent jars.

Maven is one way of doing it. In pom you give the details of the jars as dependencies. If you have set up maven correctly in your system, once you have project with pom and dependencies defined, the jars will be referenced. You can run maven clean install, and with build plugin(maven-jar-plugin) defined in your pom, you should get a jar in target folder.

So now, your jar is build properly. Next is when you take it to cluster, it needs the jars again. One way is while building your jar, you can build a fat jar, which will add dependencies also to your jar, and you need not worry about cluster environment for jars. Other way is continue using the jar which has only your classes, and then set hadoop classpath which points to all the jars in cluster.

Finally with above set up, you are good to go using hadoop jar command

Answers to your questions

There are 2 main folders with pom. src and target. Target is generally used to store the output of the build(may be a jar or war). You can create a target folder as part of your build script or can create during development in eclipse.

How to check if maven is installed - Well once installed, and a local repository path set, and run maven install - this will result in fetching the jars defined in pom and store them in local repository. If this happens, then you are good. Challenges are firewall issues when downloading the jars from external sources in internet.

And do jobs use same pom - A mapreduce job is defined with a java class. So all those jobs in that jar will use the same pom. Thats obvious. You can continue reading on building, jar referencing, maven usage, ant comparison(traditional way of building) with maven - to improve your knowledge

Upvotes: 1

Related Questions