Hadoop input from different servers

Question

I have one master node and two data nodes which are in different servers. For the two data node, each of them has a log file in its own HDFS. Now I want to run the Hadoop to do a map/reduce on master node and the input should be the two log files from the two data nodes' HDFS. Can I do this? If I can, how can I set the input path ? (e.g. hadoop jar wordcount.jar datanode1/input/logfile1 datanode2/input/logfile2 output ...like this?) Is it possible that the input from the different datanode's HDFS which are in different servers?

Tariq · Accepted Answer

When you say Hadoop, there is nothing like its own HDFS. HDFS is a distributed FS and is spread across all the machines in a Hadoop cluster functioning as a single FS.

You just have to put both the files inside one HDFS directory and give this directory as input to you MapReduce job.

FileInputFormat.addInputPath(job, new Path("/path/to/the/input/directory"));

Same holds true for MapReduce jobs. Although you submit your job to JobTracker, the job actually runs in a distributed fashion on all the nodes of your cluster, where data to processed is present.

Oh, one more thing...A file in HDFS is not stored as a whole on any particular machine. It gets chopped into small blocks of 64MB(configurable) and these blocks are stored on different machines randomly across your cluster.

Hadoop input from different servers

Answers (1)

Related Questions