Reputation: 5397
I have been noticing this behavior that when i have many small gz files of KB. Hadoop
job takes more time to complete. In contrast to when i combine those small gz files into one big gz files. Also, number of mappers in the small gz file cases are the same as the number of files, why is i so? But in the later case it is just one. So, that might be the one reason, is it?
Upvotes: 0
Views: 53
Reputation: 6855
Hadoop in general works well with small number of large files not the other way around.
By default MapReduce assigns a map task for each input file that has to be processed. Hence, if you have a lot of small gz files then by default each file gets its own Mapper to process the file. In general, JVM initialization takes around seconds apart from your actual processing. Hence you are seeing increase in time as number of files increases.
It's recommended to have files close to the block size both to eliminate the small files problem.
Take a look this blog post from cloudera and this SF question as well.
Upvotes: 1