Reputation: 23
I have one master node and two data nodes which are in different servers. For the two data node, each of them has a log file in its own HDFS. Now I want to run the Hadoop to do a map/reduce on master node and the input should be the two log files from the two data nodes' HDFS. Can I do this? If I can, how can I set the input path ? (e.g. hadoop jar wordcount.jar datanode1/input/logfile1 datanode2/input/logfile2 output ...like this?) Is it possible that the input from the different datanode's HDFS which are in different servers?
Upvotes: 0
Views: 377
Reputation: 34184
When you say Hadoop, there is nothing like its own HDFS. HDFS is a distributed FS and is spread across all the machines in a Hadoop cluster functioning as a single FS.
You just have to put both the files inside one HDFS directory and give this directory as input to you MapReduce job.
FileInputFormat.addInputPath(job, new Path("/path/to/the/input/directory"));
Same holds true for MapReduce jobs. Although you submit your job to JobTracker, the job actually runs in a distributed fashion on all the nodes of your cluster, where data to processed is present.
Oh, one more thing...A file in HDFS is not stored as a whole on any particular machine. It gets chopped into small blocks of 64MB(configurable) and these blocks are stored on different machines randomly across your cluster.
Upvotes: 1