Order of execution / priority of Hadoop map tasks

Question

I have ~5000 entries in my Hadoop input file, but I know in advance that some of the lines will take much longer to process than others (in the map stage). (Mainly because I need to download a file from Amazon S3, and the size of the file will vary between tasks)

I want to make sure that the biggest map tasks are processed first, to make sure that all my hadoop nodes will finish working roughly at the same time.

Is there a way to do that with Hadoop? Or do I need to rework the whole thing? (I am new to Hadoop)

Thanks!

Order of execution / priority of Hadoop map tasks

Answers (1)

Related Questions