Reputation: 1867
I have a hadoop mapper code that takes file as input processes them and emits the a single key and the processed value to the reducer. The issue is I have close to 100000 text files not more than 5,6 kb but when I run the job, it takes ages to complete. One reason is each mapper is being started to process only one file and then gets destroyed. Thus I am losing lot of time in mapper start up even though the processing is not at all computationally expensive. How do I ensure that mappers continue to process multiple files ? My maximum limit of the mappers is default setting and I can see that as 6.
Do let me know if any further details are required.
Upvotes: 3
Views: 196
Reputation: 16392
You should use the CombineFileInputFormat to process many small files. This really helps performance.
Upvotes: 3