Tom Sebastian
Tom Sebastian

Reputation: 3433

No. of files Vs No. of blocks in HDFS

I am running a singlenode hadoop environment. When I ran $hadoop fsck /user/root/mydatadir -block, I really got confused around output it gave:


Status: HEALTHY
 Total size:    998562090 B
 Total dirs:    1
 Total files:   50 (Files currently being written: 1)
 Total blocks (validated):      36 (avg. block size 27737835 B) (Total open file                                                         blocks (not validated): 1)
 Minimally replicated blocks:   36 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       36 (100.0 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    2
 Average block replication:     1.0
 Corrupt blocks:                0
 Missing replicas:              72 (200.0 %)
 Number of data-nodes:          1
 Number of racks:               1

It says I have written 50 files and yet it only uses 36 blocks (I just Ignore the file currently being written).

From my understanding each file uses atleast 1 block even though its size is less than HDFS block size(for me it's 64MB, the default size).i.e, I expect 50 blocks for 50 files. What is wrong with my understanding ?

Upvotes: 2

Views: 2766

Answers (1)

WestCoastProjects
WestCoastProjects

Reputation: 63062

The files do not require full blocks each. The concern is overhead of managing them as well as - if you have truly many of them- namenode utilization:

From Hadoop - The Definitive Guide:

small files do not take up any more disk space than is required to store the raw contents of the file. For example, a 1 MB file stored with a block size of 128 MB uses 1 MB of disk space, not 128 MB.) Hadoop Archives, or HAR files, are a file archiving facility that packs files into HDFS blocks more efficiently, thereby reducing namenode memory usage while still allowing transparent access to files.

However, a single block only contains a single file - unless a specialized input format such as HAR, SequenceFile, or CombineFileIputFormat is used. Here is some more information Small File problem info

Upvotes: 1

Related Questions