How is hadoop job run on various nodes

Question

I am new to Hadoop, so may ask dum questions.

Given that I have three Hadoop slave nodes, all of them have weather data.Saying

Node-1 has weather data from year 1900 - 1929;
Node-2 has weather data from year 1930 - 1959;
Node-3 has weather data from year 1960 - 1989;

I have a Map Reduce job to find higher temperature from 1900 to 1989.

My question is:

when we submit mr job, will Hadoop automatically submit the job on those three nodes? Or we need to write script to do so.

THANKS for your patient and answers

Jagrut Sharma · Accepted Answer

HDFS is a distributed file system. So, the weather data will automatically be distributed among the 3 slave nodes. By default, it will be replicated 3 times. Node 1, Node 2 and Node 3 may all hold pieces of data from all the 3 time frames (1900-1929, 1930-1959, 1960-1989). This distribution and replication is automatically done when the data is uploaded to HDFS. There is a master node called the NameNode that keeps the metadata information on mapping of the file blocks and the nodes on which they reside.

MapReduce is a distributed data processing method. A MapReduce job submitted to the cluster will automatically be distributed across the 3 nodes. Map and reduce tasks will run on the nodes, trying to leverage data locality as much as possible. This means, each node will try to process data stored on it whenever possible. If there are task failures, they will be retried up to a certain number of times. All this happens automatically as part of the job execution.

For a deeper dive, please take a look at the Hadoop MapReduce tutorial.

How is hadoop job run on various nodes

Answers (2)

Related Questions