Reputation: 197
Assume in a Hadoop cluster we have 2 racks rck1 and rck2. Each rack has 5 nodes. How will the Namenode come to know that node 1 belongs to rack 1, node 3 belongs to rack 2.
Upvotes: 0
Views: 569
Reputation: 2833
You must configure the system to specify how the rack information is determined. For example, this Cloudera link tells you how to configure the racks for hosts in Cloudera Manager.
Alternatively, this Apache link explains how this information can be specified in an external script of java class via configuration files.
The topology is typically of the form /myrack/myhost, although you can use a deeper hierarchy. They have the following example in python which assumes a /24 subnet for each rack, and hence extracts the first three bytes of the IP address to use as a rack number - you could adopt a similar approach if you can set the node IP addresses accordingly, or write your own script to determine rack from IP address or other available information on each node (even a simple hard-coded mapping between e.g. hostname and rack would work in your example with relatively few nodes).
#!/usr/bin/python
# this script makes assumptions about the physical environment.
# 1) each rack is its own layer 3 network with a /24 subnet, which
# could be typical where each rack has its own
# switch with uplinks to a central core router.
#
# +-----------+
# |core router|
# +-----------+
# / \
# +-----------+ +-----------+
# |rack switch| |rack switch|
# +-----------+ +-----------+
# | data node | | data node |
# +-----------+ +-----------+
# | data node | | data node |
# +-----------+ +-----------+
#
# 2) topology script gets list of IP's as input, calculates network address, and prints '/network_address/ip'.
import netaddr
import sys
sys.argv.pop(0) # discard name of topology script from argv list as we just want IP addresses
netmask = '255.255.255.0' # set netmask to what's being used in your environment. The example uses a /24
for ip in sys.argv: # loop over list of datanode IP's
address = '{0}/{1}'.format(ip, netmask) # format address string so it looks like 'ip/netmask' to make netaddr work
try:
network_address = netaddr.IPNetwork(address).network # calculate and print network address
print "/{0}".format(network_address)
except:
print "/rack-unknown" # print catch-all value if unable to calculate network address
Upvotes: 2