How to force synchronous hdfs replication?

Question

I wrote a simple program that puts data to HDFS. I set dfs.replication to 3 via Configuration object and run this program against pseudo distributed hdfs cluster. I expected to get an Exception since the replication factor is 3 and there is only one datanode; but the program successfully finished its execution.

Is it a way to understand that my data is in under-replicated state? I think this relates to dfs.replication.min, but when i changed it to 3 too (in the program Configuration object), writes are still successful and i didn't get any exceptions.

rystsov · Accepted Answer

I've finished some tests and found out the reason.

At first, when you create a file its replication factor must be higher or equal to dfs.replication.min. HDFS provides that the replication up to dfs.replication.min nodes is synchronous, replication to the rest of the nodes (dfs.replication - dfs.replication.min) is being processed asynchronously.

Since the default settings for dfs.replication.min is 1, I successfully wrote a file with dfs.replication = 3 to hdfs cluster of one node.

The default replication factor (dfs.replication) is set 3, but can be changed per request via Configuration object. The sad part is that you can't change dfs.replication.min per request, so you can't improve reliability if it was set to lower layer.

How to force synchronous hdfs replication?

Answers (2)

Related Questions