Reputation: 59

Does HDFS needs 3 times the data space?

I was attending a course on Hadoop and MapReduce on Udacity.com and the instructor mentioned that In HDFS to reduce the point of failures each block is replicated 3 times in Database. Is it true for real?? Does it mean that If I have 1 petabytes of Logs I will need 3 Petabytes of Storage?? Beacuse that will cost me more

Upvotes: 0

Answers (4)

Tuxman

Reputation: 398

By default, HDFS conf parameter dfs.replication is set with value 3. That allow fault tolerance, disponibility, etc... (All parameters of HDFS here)

But in install time, you could set the parameter in 1, and HDFS don't make replicas of your data. With dfs.replication=1, 1 petabyte is storaged in the same space amount.

Upvotes: 1

Abhijeet Dhumal

Reputation: 1809

This is because HDFS replicates data when you store it. The default replication factor for hdfs is 3, which you can find in hdfs-site.xml file under dfs.replication property. You can set this value to 1 or 5 as per your requirement.

Data replication is much useful as if some node particularly goes down, you will have the copy of data available on other node/nodes for processing.

Upvotes: 0

Remus Rusanu

Reputation: 294407

Yes, is true, HDFS requires space for each redundant copy and requires copies to achieve failure tolerance and data locality during processing.

But this is not necessarily true about MapReduce, which can run on other file systems like S3 or Azure blobs for instance. It is HDFS that requires the 3 copies.

Upvotes: 1

SMA

Reputation: 37083

Yes that's true. So say if you have say 4 machines with datanodes running on them, then by default replication will happen in other two machines at random as well. If you don't want that, you can switch it to 1 by setting dfs.replication property in hdfs-site.xml

Upvotes: 0

Does HDFS needs 3 times the data space?

Answers (4)

Related Questions