How to understand hadoop file size and locality optimization

Question

By default Hadoop block size is 64MB. It is suggest that each file in Hadoop is a less than 64MB, so that each file is in a block. When a map function starts, it can read all data from of the file from one block, without extra data transfer. So this achieves locality optimization.

My question is, does this rule applies to file that can be split? E.g. most text files, csv files.

Each map function only process a split of a file. And the default text file spliter makes sure each split falls into one block. So I think for file like CSV format, even it is over 1 block size, locality optimization can still be guaranteed.

Andrey Sozykin · Accepted Answer

You are right, by default each Map function process one split of file, which size is one block.

But locality optimization cannot be guaranteed, because you can have more blocks of the file on the datanode, then the Map slots on that node. For example, your cluster node stores three blocks of the file, but has only two Map slots. In such case, two Mapper process will be executed on the local node and one on the remote node. One block of data will be transferred to the remote node through network.

In addition, if you have a huge number of small files (smaller then the block size), you are still can read the full HDFS block in one disk operation using the CombineFileInputFormat (example). This approach can significantly increase performance.

How to understand hadoop file size and locality optimization

Answers (2)

Related Questions