Reputation: 422
Currently I'm using Sequence File
to compress our existing HDFS data.
Now I have two options to store this Sequence File
as
As we know, the HDFS files stored as block, each block goes to one mapper. So I think there's no different when MR processing against that Sequence File(s).
The only one disadvantage I know for option two is namenode needs more overhead to maintain those files, whereas there's only one file for option one.
I am comfusing about these two options since I saw too many articles recommend that
Can anyone point me the correct way to do this? which is better? Any advantage/disadvantage for these two options? Thanks!
Upvotes: 0
Views: 1227
Reputation: 5636
Quora.com has one question about (for old version as 128MB is now default block size) why 64MB chosen as default chunk size, though question is relatively different but the answer from Ted Dunning has answer for your question too. Ted Dunning wrote:
The reason Hadoop chose 64MB was because Google chose 64MB. The reason Google chose 64MB was due to a Goldilocks argument.
So i think point 2 & 3 has answer for you and now you have to decide based on your requirement to store file as one single big file or in smaller chunks of 128MB (Ya if you can increase the block size too if you want).
Upvotes: 4