Jon Andrews
Jon Andrews

Reputation: 383

How does balancer work in HDFS?

Balancer iteratively moves replicas from DataNodes with higher utilization to DataNodes with lower utilization.

Will that affect the concept of Rack awarness ?

For example I have three machines placed in two racks and data is placed by following the concept of rack awarness.

What would happen if I add a new machine to the cluster and run the balancer command?

Upvotes: 0

Views: 1008

Answers (2)

Bhaskar Das
Bhaskar Das

Reputation: 682

If your question is how load balancing is used: Load balancing is helpful in spreading the load equally across the free nodes when a node is loaded above its threshold level.

Now A cluster is considered balanced if for each data node, the ratio of used space at the node to the total capacity of node (known as the utilization of the node) differs from the the ratio of used space at the cluster to the total capacity of the cluster (utilization of the cluster) by no more than the threshold value.

When you apply load balancing during runtime, it is called dynamic load balancing and this can be realized both in a direct or iterative manner according to the execution node selection:

  • In the iterative methods, the final destination node is determined through several iteration steps.
  • In the direct methods, the final destination node is selected in one step.

Rack Awareness

Rack Awareness prevents losing data when an entire rack fails and allows to make use of bandwidth from multiple racks when reading a file.

On Multiple rack cluster, block replications are maintained with a policy that no more than one replica is placed on one node and no more than two replicas are placed in the same rack with a constraint that number of racks used for block replication should be always less than total no of block replicas.

For example,

  1. When a new block is created, the first replica is placed on the local node, the second one is placed at a different rack, the third one is on a different node at the local rack.
  2. When re-replicating a block, if the number of existing replicas is one, place the second one on a different rack.
  3. When the number of existing replicas is two, if the two replicas are on the same rack, place the third one on a different rack;
  4. For reading, the name node first checks if the client’s computer is located in the cluster. If yes, block locations are returned from the close data nodes to the client.

It minimizes the write cost and maximizing read speed.

Upvotes: 0

OneCricketeer
OneCricketeer

Reputation: 191681

Rack awareness & data locality is a YARN concept. The HDFS balancer only cares about leveling out the Datanode usage.

If you have 3 machines, with 3 replicas by default, then every machine could be guaranteed to have 1 replica, therefore with 2 racks, you're practically guaranteed to have rack locality.

Node locality is more performant than rack awareness, anyway.

If you have 10 GB intra cluster speeds between nodes, data locality is a moot point. This is why AWS can still reasonably process data in S3, for example, where data locality processing is not available

Upvotes: 1

Related Questions