Reputation: 309
I understand the disadvantages of small files and small block sizes in HDFS. I'm trying to understand the rationale behind the the default 64/128 MB block size. Are there any drawbacks of having a large block size (say 2GB. I read that larger values than that cause issues, the details of which I haven't yet dug into).
Issues I see with too large block sizes (please correct me, may be some or all of these issues don't really exist)-
Possibly, there could be issues with replicating a 1 Gig file when a data node goes down - which requires the cluster to transfer the whole file. This seems to be a problem when we are considering a single file - but we may have to transfer a lot many smaller files if we had smaller block sizes say 128 MB (which I think involves more overhead)
Could trouble mappers. Large blocks might end up with each mapper thus reducing the possible number of mappers. But this should not be an issue if we use a smaller split size?
It one sounded stupid when it occurred to me that this could be an issue but I thought I'll throw it in anyways - Since the namenode does not know the size of the file beforehand, it is possible for it to consider a data node not available since it does not have enough disk space for a new block (considering a large block size of may be 1-2 Gigs). But may be it solves this problem smartly by just cutting down the block size of that particular block (which probably is a bad solution resulting).
Block size may probably depend on the use case. I basically want to find an answer to the question - Is there a situation/use case where large block size setup can hurt?
Any help is appreciated. Thanks in advance.
Upvotes: 7
Views: 4541
Reputation: 63269
I did extensive performance validations of high end clusters on hadoop and we varied the block sizes from 64 Meg up to 2GB. To answer the question: imagine workloads in which oftentimes smallish files need to be processed, say 10's of Megs. Which blocksize do you think will be more performant in that case - 64MEg or 1024Meg?
For the case of large files then yes the large block sizes tend towards better performance since the overhead of mappers is not negligible.
Upvotes: 2