CodeReaper
CodeReaper

Reputation: 387

Hadoop Optimization Suggestion

Consider a scenario: If I increase the replication factor of the data I have in HDFS ; suppose in a 10 node cluster I make the RF = 5 instead of 3(default), will it increase the performance of my data processing tasks?

Will the map phase complete sooner compared to the default replication setting?

Will there be any effect on the reduce phase?

Upvotes: 0

Views: 396

Answers (2)

Manjunath Ballur
Manjunath Ballur

Reputation: 6343

Impact of Replication on Storage:

  • Replication factor has a huge impact on the storage of the cluster. It's obvious that: Larger the replication factor, lesser the number of files you can store in the cluster.
  • If replication factor is 5, then for every 1 GB of data ingested into cluster, you will need 5 GB of storage space and you will quickly run out of space in the cluster.
  • Since NameNode stores all the meta information in memory, it will quickly run of space to store the meta data. Hence, your NameNode will have to be allocated more memory (check HADOOP_NAMENODE_OPTS).
  • Data copy operation will take more time, since data copy is daisy-chained across Data Nodes. Instead of 3 Data Nodes, now 5 Data Nodes will have to confirm data storage, before a write/append is committed

Impact of Replication on Computation:

Mapper:

  • With a higher replication factor, there are more options to schedule a mapper. With a replication factor of 3, you can schedule a mapper on 3 different nodes. But, with a factor of 5, you will have 5 choices
  • You may be able to achieve better data locality, with increase in the replication factor. Each of the mapper could get scheduled on the same node where the data is present (since now there are 5 choices compared to the default 3), thus improving the performance.
  • Since there is a better data locality, lesser number of mappers will copy off-node or off-rack data

Due to these reasons, its possible that, with a higher replication factor, the mappers could complete earlier than with a lower replication factor.

Since typically the number of mappers are always higher than the number of reducers, you may see an overall improvement in your job performance.

Reducer:

  • Since the output of the reducer directly gets written into HDFS, its possible that your reducers will take more time to execute, with a higher replication factor.

Overall, your mappers may execute faster with a higher replication factor. But, actual performance improvement depends on various factors like, the size of your cluster, bandwidth, NameNode memory etc.

After answering this question, I came across another similar question in SO here: Map Job Performance on cluster. This also contains some more information, with links to various research papers.

Upvotes: 1

Okezie
Okezie

Reputation: 5118

Setting the replication factor to 5 will cause the HDFS namenode to maintain 5 total copies of the file blocks on the available datanodes in the cluster. This copy operation performed by the namenode will result in higher network bandwidth usage depending on the size of the files to be replicated and the speed of your network.

The replication factor has no direct effect in the either the map or reduce phase. You may see a performance hit initially while blocks are being replicated while running a map-reduce job - this could cause significant network latency depending on the size of the files and your network bandwidth.

A replication factor of 5 across your cluster means that 4 of your data nodes can disappear from your cluster, and you'll still have enough nodes to access to all files in HDFS with no file corruption or missing blocks. If your RF = 4 then you can loose 3 servers and still have access to all files in HDFS.

Setting a higher replication factor increases your overall HDFS usage so if your total data size is 1TB a RF=3 means your HDFS usage will be 3TB since the chopped up blocks are duplicated n-1 (3-1 = 2) times across the cluster.

Upvotes: 0

Related Questions