Michael
Michael

Reputation: 422

A single large file or multi small files in HDFS (Sequence File)?

Currently I'm using Sequence File to compress our existing HDFS data.

Now I have two options to store this Sequence File as

As we know, the HDFS files stored as block, each block goes to one mapper. So I think there's no different when MR processing against that Sequence File(s).

The only one disadvantage I know for option two is namenode needs more overhead to maintain those files, whereas there's only one file for option one.

I am comfusing about these two options since I saw too many articles recommend that

Can anyone point me the correct way to do this? which is better? Any advantage/disadvantage for these two options? Thanks!

Upvotes: 0

Views: 1227

Answers (1)

Rajen Raiyarela
Rajen Raiyarela

Reputation: 5636

Quora.com has one question about (for old version as 128MB is now default block size) why 64MB chosen as default chunk size, though question is relatively different but the answer from Ted Dunning has answer for your question too. Ted Dunning wrote:

The reason Hadoop chose 64MB was because Google chose 64MB. The reason Google chose 64MB was due to a Goldilocks argument.

  1. Having a much smaller block size would cause seek overhead to increase.
  2. Having a moderately smaller block size makes map tasks run fast enough that the cost of scheduling them becomes comparable to the cost of running them.
  3. Having a significantly larger block size begins to decrease the available read parallelism available and may ultimately make it hard to schedule tasks local to the tasks.

So i think point 2 & 3 has answer for you and now you have to decide based on your requirement to store file as one single big file or in smaller chunks of 128MB (Ya if you can increase the block size too if you want).

Upvotes: 4

Related Questions