How will hdfs choose which copy to delete?

Question

Suppose: I have a hadoop cluster in which each data slide is set to have 3 copies.

One day, a datanode is unplugged (assume data stored inside is fine), and then hdfs will generate new copies for data stored in this node so that data slide still have 3 copies. But if datanode is plugged in again the next day, some data slide have 4 copies, then hdfs has to delete 1 of 4 copies.

My question is how does hdfs choose the one to delete? randomly? or just delete the newest one (that means the datanode will be cleared)?

mrsrinivas · Accepted Answer

Question: But if Datanode is repaired and starts to work again, some data slide have 4 copies, then HDFS has to delete 1 of 4 copies

As you mentioned, In HDFS when any Datanode unplugged balancer will create the lost copy in another node to maintain proper replication factor for blocks.

Now if we want to include the same/diff node to HDFS, We do format and include the node to cluster. So, there will not be excessive replicated blocks in the cluster at any point of time.

How will hdfs choose which copy to delete?

Answers (2)

Related Questions