Reputation: 117
I am trying to implement an algorithm where only single reducer is required and mapreduce job is executing iteratively. Result of each mapper in particular iteration is to be added in reducer and then processed. Then output of the reducer is passed as input to mapper in other iteration. I want to execute the job in asynchronous manner i.e. as soon as pre-defined number of mappers are executed, pass the output directly to the reducer i.e. avoiding the shuffling and sorting as its creating only overhead for my algorithm. Is that even possible? If not, what can be done for asynchronous exceution of mapreduce job at implementation level. I went to number of research papers but unable to get any idea from there.
Thanks.
Upvotes: 2
Views: 184
Reputation: 3683
You have to code up you own custom solution for this. I did a similar thing in a project recently.
It requires a bit of code, so I can only outline the steps here :)
mapreduce.job.reduce.slowstart.completedmaps
to 0.0
so that the reducer comes up before the mappers finish (this will give you a speedup right away btw. try it out before going ahead with below steps ;) maybe it's enough)org.apache.hadoop.mapred.MapOutputCollector
that writes the shuffle output to Socket instead of to the standard shuffle path (this is the mapper side)org.apache.hadoop.mapred.ShuffleConsumerPlugin
that waits for connections by mappers and reads pairs from the network (this is the reducer side)Things you will need to do:
Futher reading: https://hadoop.apache.org/docs/r2.6.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html
Def. doable, but requires some effort :)
Upvotes: 3