Reputation: 3433
I am running a singlenode hadoop environment. When I ran $hadoop fsck /user/root/mydatadir -block
, I really got confused around output it gave:
Status: HEALTHY
Total size: 998562090 B
Total dirs: 1
Total files: 50 (Files currently being written: 1)
Total blocks (validated): 36 (avg. block size 27737835 B) (Total open file blocks (not validated): 1)
Minimally replicated blocks: 36 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 36 (100.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 2
Average block replication: 1.0
Corrupt blocks: 0
Missing replicas: 72 (200.0 %)
Number of data-nodes: 1
Number of racks: 1
It says I have written 50 files and yet it only uses 36 blocks (I just Ignore the file currently being written).
From my understanding each file uses atleast 1 block even though its size is less than HDFS block size(for me it's 64MB, the default size).i.e, I expect 50 blocks for 50 files. What is wrong with my understanding ?
Upvotes: 2
Views: 2766
Reputation: 63062
The files do not require full blocks each. The concern is overhead of managing them as well as - if you have truly many of them- namenode utilization:
From Hadoop - The Definitive Guide:
small files do not take up any more disk space than is required to store the raw contents of the file. For example, a 1 MB file stored with a block size of 128 MB uses 1 MB of disk space, not 128 MB.) Hadoop Archives, or HAR files, are a file archiving facility that packs files into HDFS blocks more efficiently, thereby reducing namenode memory usage while still allowing transparent access to files.
However, a single block only contains a single file - unless a specialized input format such as HAR, SequenceFile, or CombineFileIputFormat is used. Here is some more information Small File problem info
Upvotes: 1