user2552010
user2552010

Reputation: 23

Hadoop input from different servers

I have one master node and two data nodes which are in different servers. For the two data node, each of them has a log file in its own HDFS. Now I want to run the Hadoop to do a map/reduce on master node and the input should be the two log files from the two data nodes' HDFS. Can I do this? If I can, how can I set the input path ? (e.g. hadoop jar wordcount.jar datanode1/input/logfile1 datanode2/input/logfile2 output ...like this?) Is it possible that the input from the different datanode's HDFS which are in different servers?

Upvotes: 0

Views: 377

Answers (1)

Tariq
Tariq

Reputation: 34184

When you say Hadoop, there is nothing like its own HDFS. HDFS is a distributed FS and is spread across all the machines in a Hadoop cluster functioning as a single FS.

You just have to put both the files inside one HDFS directory and give this directory as input to you MapReduce job.

FileInputFormat.addInputPath(job, new Path("/path/to/the/input/directory"));

Same holds true for MapReduce jobs. Although you submit your job to JobTracker, the job actually runs in a distributed fashion on all the nodes of your cluster, where data to processed is present.

Oh, one more thing...A file in HDFS is not stored as a whole on any particular machine. It gets chopped into small blocks of 64MB(configurable) and these blocks are stored on different machines randomly across your cluster.

Upvotes: 1

Related Questions