Reputation: 1605
I have ~5000 entries in my Hadoop input file, but I know in advance that some of the lines will take much longer to process than others (in the map stage). (Mainly because I need to download a file from Amazon S3, and the size of the file will vary between tasks)
I want to make sure that the biggest map tasks are processed first, to make sure that all my hadoop nodes will finish working roughly at the same time.
Is there a way to do that with Hadoop? Or do I need to rework the whole thing? (I am new to Hadoop)
Thanks!
Upvotes: 4
Views: 1038
Reputation: 1652
Well if you would implement your custom InputFormat (the getSplits() method contains the logic about split creation), then theoretically you could achieve what you want.
BUT, you have to take special care, because the order of how the InputFormat returns the splits is not the order of how Hadoop will process it. There is a split re-ordering code inside the JobClient:
// sort the splits into order based on size, so that the biggest
// go first
Arrays.sort(array, new NewSplitComparator());
which will make the whole thing more tricky. But you could implement a custom InputFormat + a custom InputSplit and make the InputSlip#length() dependent on its expected execution time.
Upvotes: 2