Yahia
Yahia

Reputation: 1339

What are the possible reasons behind the imbalance of files stored on HDFS?

Sometimes, the data blocks are stored in imbalanced way across the data node. Based on HDFS block placement policy, the first replica is favored to be stored on the writer node (i.e. the client node), then the second replica is stored on a remote rack and the third one is stored on a local rack. What are the use cases that make the data blocks unbalanced across the data nodes under this placement policy? one possible reason in mind that if the writer nodes are few, then one replica of the data blocks will be stored on these nodes. Are there any other reasons ?

Upvotes: 2

Views: 1196

Answers (1)

user674469
user674469

Reputation:

Here are some potential reasons for data skew:

  • If some of the DataNodes are unavailable for some time (not accepting requests/writes), the cluster can end up unbalanced.
  • TaskTrackers are not collocated with DataNodes evenly across cluster nodes. If we write data through MapReduce in this situation, the cluster can be unbalanced because the nodes hosting both a TaskTracker and a DataNode would be preferred.
  • Same as above, but with the RegionServers of HBase.
  • Large deletion of data can result in an unbalanced cluster depending on the location of the deleted blocks.
  • Adding new DataNodes will not automatically rebalance existing blocks across the cluster.

The "hdfs balancer" command allows admins to rebalance the cluster. Also, https://issues.apache.org/jira/browse/HDFS-1804 added a new block storage policy that takes into account free space left on the volume.

Upvotes: 5

Related Questions