Chen Guo
Chen Guo

Reputation: 23

How is hadoop job run on various nodes

I am new to Hadoop, so may ask dum questions.

Given that I have three Hadoop slave nodes, all of them have weather data.Saying

I have a Map Reduce job to find higher temperature from 1900 to 1989.

My question is:

when we submit mr job, will Hadoop automatically submit the job on those three nodes? Or we need to write script to do so.

THANKS for your patient and answers

Upvotes: 0

Views: 480

Answers (2)

OneCricketeer
OneCricketeer

Reputation: 191701

Data is not sharded by date ranges or keys on insertion to HDFS, it's equally distributed and also replicated across all nodes evenly based on "HDFS blocks". However, let's say replication is set to 1, but even then, parts of the first range could exist on all three nodes while the last range is only on one node.

The HDFS protocols decide where to place blocks, not your external applications as a client

MapReduce (or Spark) will process "Input splits" wherever they exist (which could easily just be one node). And this is automatic. Your code will be deployed on each node (assuming YARN is used) to read the data on the datanode itself (if that's how NodeManagers are installed), then collected back to a Driver process, which is optionally output to your local terminal

Upvotes: 0

Jagrut Sharma
Jagrut Sharma

Reputation: 4754

HDFS is a distributed file system. So, the weather data will automatically be distributed among the 3 slave nodes. By default, it will be replicated 3 times. Node 1, Node 2 and Node 3 may all hold pieces of data from all the 3 time frames (1900-1929, 1930-1959, 1960-1989). This distribution and replication is automatically done when the data is uploaded to HDFS. There is a master node called the NameNode that keeps the metadata information on mapping of the file blocks and the nodes on which they reside.

MapReduce is a distributed data processing method. A MapReduce job submitted to the cluster will automatically be distributed across the 3 nodes. Map and reduce tasks will run on the nodes, trying to leverage data locality as much as possible. This means, each node will try to process data stored on it whenever possible. If there are task failures, they will be retried up to a certain number of times. All this happens automatically as part of the job execution.

For a deeper dive, please take a look at the Hadoop MapReduce tutorial.

Upvotes: 1

Related Questions