Make mappers process multiple files and not single files

Question

I have a hadoop mapper code that takes file as input processes them and emits the a single key and the processed value to the reducer. The issue is I have close to 100000 text files not more than 5,6 kb but when I run the job, it takes ages to complete. One reason is each mapper is being started to process only one file and then gets destroyed. Thus I am losing lot of time in mapper start up even though the processing is not at all computationally expensive. How do I ensure that mappers continue to process multiple files ? My maximum limit of the mappers is default setting and I can see that as 6.

Do let me know if any further details are required.

Chris Gerken · Accepted Answer

You should use the CombineFileInputFormat to process many small files. This really helps performance.

Make mappers process multiple files and not single files

Answers (1)

Related Questions