Naresh
Naresh

Reputation: 5397

Hadoop job completion time increases as the number of input gz files increases

I have been noticing this behavior that when i have many small gz files of KB. Hadoop job takes more time to complete. In contrast to when i combine those small gz files into one big gz files. Also, number of mappers in the small gz file cases are the same as the number of files, why is i so? But in the later case it is just one. So, that might be the one reason, is it?

Upvotes: 0

Views: 53

Answers (1)

Ashrith
Ashrith

Reputation: 6855

Hadoop in general works well with small number of large files not the other way around.

By default MapReduce assigns a map task for each input file that has to be processed. Hence, if you have a lot of small gz files then by default each file gets its own Mapper to process the file. In general, JVM initialization takes around seconds apart from your actual processing. Hence you are seeing increase in time as number of files increases.

It's recommended to have files close to the block size both to eliminate the small files problem.

Take a look this blog post from cloudera and this SF question as well.

Upvotes: 1

Related Questions