snow_leopard
snow_leopard

Reputation: 1556

Why should I avoid storing lots of small files in Hadoop HDFS?

I have read that lots of small files stored in HDFS can be a problem because lots of small files means lots of objects Hadoop NameNode memory.

However since each block is stored in named node as an object, how is it different for a large file? Whether you store 1000 blocks from a single file in memory or 1000 blocks for 1000 files, is the amount of NameNode memory used the same?

Similar question for Map jobs. Since they operate on blocks, how does it matter if blocks are of small files or from bigger ones ?

Upvotes: 1

Views: 959

Answers (1)

Andrew Mo
Andrew Mo

Reputation: 1463

At a high-level, you can think of a Hadoop NameNode as a tracker for where blocks composing 'files' stored in HDFS are located; blocks are used to break down large files into smaller pieces when stored in an HDFS cluster.

  • When you have lots of small files stored in HDFS, there are also lots of blocks, and the NameNode must keep track of all of those files and blocks in memory.

When you have a large file, for example -- if you combined all of those files into bigger files, first -- you would have fewer files stored in HDFS, and you would also have fewer blocks.

First let's discuss how file size, HDFS blocks, and NameNode memory relate:

This is easier to see with examples and numbers.

Our HDFS NameNode's block size for this example is 100 MB.

Let's pretend we have a thousand (1,000) 1 MB files and we store them in HDFS. When storing these 1,000 1 MB files in HDFS, we would have also have 1,000 blocks composing those files in our HDFS cluster.

  • Each block stored in HDFS requires about 150 bytes of NameNode memory, which is about 150 KB of memory for those 1,000 blocks representing 1,000 1 MB files.

Now, consider that we consolidate or concatenate those 1,000 1 MB files into a single 1,000 MB file and store that single file in HDFS. When storing the 1,000 MB file in HDFS, it would be broken down into blocks based on our HDFS cluster block size; in this example our block size was 100 MB, which means our 1,000 MB file would be stored as ten (10) 100 MB blocks in the HDFS cluster.

  • Each block stored in HDFS requires about 150 bytes of NameNode memory, which is about 1.5 KB of memory for those 10 blocks representing the 1 1,000 MB file.

With the larger file, we have the same data stored in the HDFS cluster, but use 1% of the NameNode memory compared to the situation with many small files.

Input blocks and the number of Map tasks for a job are related.

When it comes to Map tasks, generally you will have 1-map task per input block. The size of input blocks here matters because there is overhead from starting and finishing new tasks; i.e. when Map tasks finish too quickly, the amount of this overhead becomes a greater portion of each tasks's completion time, and completion of the overall job this can be slower than the same job but with fewer, bigger input blocks. For a MapReduce2-based job, Map tasks also involve starting and stopping a YARN container at the resource management layer, for each task, which adds overhead. (Note that you can also instruct MapReduce jobs to use a minimum input size threshold when dealing with many small input blocks to address some of these inefficiencies as well)

Upvotes: 3

Related Questions