Praveen Gr
Praveen Gr

Reputation: 197

In Hadoop, How will Namenode get the rack details and the datanode belonging to a rack?

Assume in a Hadoop cluster we have 2 racks rck1 and rck2. Each rack has 5 nodes. How will the Namenode come to know that node 1 belongs to rack 1, node 3 belongs to rack 2.

Upvotes: 0

Views: 569

Answers (1)

mc110
mc110

Reputation: 2833

You must configure the system to specify how the rack information is determined. For example, this Cloudera link tells you how to configure the racks for hosts in Cloudera Manager.

Alternatively, this Apache link explains how this information can be specified in an external script of java class via configuration files.

The topology is typically of the form /myrack/myhost, although you can use a deeper hierarchy. They have the following example in python which assumes a /24 subnet for each rack, and hence extracts the first three bytes of the IP address to use as a rack number - you could adopt a similar approach if you can set the node IP addresses accordingly, or write your own script to determine rack from IP address or other available information on each node (even a simple hard-coded mapping between e.g. hostname and rack would work in your example with relatively few nodes).

#!/usr/bin/python
# this script makes assumptions about the physical environment.
#  1) each rack is its own layer 3 network with a /24 subnet, which
# could be typical where each rack has its own
#     switch with uplinks to a central core router.
#
#             +-----------+
#             |core router|
#             +-----------+
#            /             \
#   +-----------+        +-----------+
#   |rack switch|        |rack switch|
#   +-----------+        +-----------+
#   | data node |        | data node |
#   +-----------+        +-----------+
#   | data node |        | data node |
#   +-----------+        +-----------+
#
# 2) topology script gets list of IP's as input, calculates network address, and prints '/network_address/ip'.

import netaddr
import sys
sys.argv.pop(0)                                                  # discard name of topology script from argv list as we just want IP addresses

netmask = '255.255.255.0'                                        # set netmask to what's being used in your environment.  The example uses a /24

for ip in sys.argv:                                              # loop over list of datanode IP's
address = '{0}/{1}'.format(ip, netmask)                      # format address string so it looks like 'ip/netmask' to make netaddr work
try:
   network_address = netaddr.IPNetwork(address).network     # calculate and print network address
   print "/{0}".format(network_address)
except:
   print "/rack-unknown"                                    # print catch-all value if unable to calculate network address

Upvotes: 2

Related Questions