Reputation: 1
I have a single Node Hadoop cluster version - 2.x. The block size i have set is 64 MB. I have an input file in HDFS of size 84 MB. Now, when i run the MR job, I see that there are 2 splits which is valid as 84 MB/64 MB ~ 2 and so 2 splits.
But when i run command "hadoop fsck -blocks" to see details of blocks, I see this.
Total size: 90984182 B
Total dirs: 16
Total files: 7
Total symlinks: 0
Total blocks (validated): 7 (avg. block size 12997740 B)
Minimally replicated blocks: 7 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 1
Average block replication: 1.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 1
Number of racks: 1
As you can see, the average block size is close to 13 MB. Why is this? ideally, the block size should be 64 MB rite?
Upvotes: 0
Views: 562
Reputation: 11
The maximum block size is 64MB as you specified, but you'd have to be pretty lucky to have your average block side be equal to the maximum block size.
Consider the one file you mentioned:
1 file, 84 MB
84MB/64MB = 2 Blocks
84MB/2 Blocks = 42 MB/block on average
You must have some other files bringing the average down even more.
Other than the memory requirement on the namenode for the blocks and possibly loss of parallelism if your block size is too high (obviously not an issue in a single-node cluster), there isn't too much of a problem with the average block size being smaller than the max.
Having 64MB max block size does not mean every block takes up 64MB on disk.
Upvotes: 1
Reputation: 166
When you configure the block size you set the maximum size a block can be. It is highly unlikely that your files are an exact multiple of the block size so many blocks will be smaller than the configured block size.
Upvotes: 0