Reputation: 387
Consider a scenario: If I increase the replication factor of the data I have in HDFS ; suppose in a 10 node cluster I make the RF = 5 instead of 3(default), will it increase the performance of my data processing tasks?
Will the map phase complete sooner compared to the default replication setting?
Will there be any effect on the reduce phase?
Upvotes: 0
Views: 396
Reputation: 6343
Impact of Replication on Storage:
Larger the replication factor, lesser the number of files you can store in the cluster
.HADOOP_NAMENODE_OPTS
). Impact of Replication on Computation:
Mapper:
Due to these reasons, its possible that, with a higher replication factor, the mappers could complete earlier than with a lower replication factor.
Since typically the number of mappers are always higher than the number of reducers, you may see an overall improvement in your job performance.
Reducer:
Overall, your mappers may execute faster with a higher replication factor. But, actual performance improvement depends on various factors like, the size of your cluster, bandwidth, NameNode memory etc.
After answering this question, I came across another similar question in SO here: Map Job Performance on cluster. This also contains some more information, with links to various research papers.
Upvotes: 1
Reputation: 5118
Setting the replication factor to 5
will cause the HDFS namenode to maintain 5
total copies of the file blocks on the available datanodes in the cluster. This copy operation performed by the namenode will result in higher network bandwidth usage depending on the size of the files to be replicated and the speed of your network.
The replication factor has no direct effect in the either the map or reduce phase. You may see a performance hit initially while blocks are being replicated while running a map-reduce job - this could cause significant network latency depending on the size of the files and your network bandwidth.
A replication factor of 5
across your cluster means that 4
of your data nodes can disappear from your cluster, and you'll still have enough nodes to access to all files in HDFS with no file corruption or missing blocks. If your RF = 4
then you can loose 3 servers and still have access to all files in HDFS.
Setting a higher replication factor increases your overall HDFS usage so if your total data size is 1TB a RF=3 means your HDFS usage will be 3TB since the chopped up blocks are duplicated n-1 (3-1 = 2
) times across the cluster.
Upvotes: 0