Hadoop terminology mapping to hardware

Question

I am starting out in Hadoop and trying to implement a Hadoop Cluster. I am new to distributed systems so am a bit confused with the terminology.

Do namenodes and datanodes correspond to physical harddisks?
If say I need to run map tasks on a single cpu unit, do we assign map tasks to individual cores or processors(with multi-cores) while setting the mapred.tasktracker.map.tasks.maximum flag
What does a "node" imply? Processor or a physical disk or a core?

Chris White · Accepted Answer

Firstly (on a terminology front), i assume you mean instantiate a Hadoop cluster rather than implement one.

A namenode manages one or more datanodes. The index of file names to block IDs is maintained by the namenode in memory and periodically flushed to disk. The actual locations of the blocks are reported by the datanodes to the name node, from which point it manages the assignment, migration, replication and removal of blocks.
A datanode manages the storage of blocks on physical hard disks. A datanode can distribute it's blocks over one or more physical disks (in fact you're encouraged to use multiple physical disks rather than a single logical volume of disks)
The Job Tracker (JT) manages the process of task assignment (either map or reduce) to a one or more Task Trackers (TT). Typically you will configure each node (physical machine) in your cluster such that the maximum number tasks that can be run (map / reduce) matches the number of cores (not a hard and fast rule, depends on how you expect to use the cluster)
Node typically implies a physical machine, which typically runs a Task Tracker (which runs map / reduce tasks) and a Data Node (storing / serving up file blocks).

Answers (1)