Ajay Nair
Ajay Nair

Reputation: 1867

Make mappers process multiple files and not single files

I have a hadoop mapper code that takes file as input processes them and emits the a single key and the processed value to the reducer. The issue is I have close to 100000 text files not more than 5,6 kb but when I run the job, it takes ages to complete. One reason is each mapper is being started to process only one file and then gets destroyed. Thus I am losing lot of time in mapper start up even though the processing is not at all computationally expensive. How do I ensure that mappers continue to process multiple files ? My maximum limit of the mappers is default setting and I can see that as 6.

Do let me know if any further details are required.

Upvotes: 3

Views: 196

Answers (1)

Chris Gerken
Chris Gerken

Reputation: 16392

You should use the CombineFileInputFormat to process many small files. This really helps performance.

Upvotes: 3

Related Questions