Reputation: 364
I already installed hadoop mapreduce in one node and I have a top ten problem.
Let's say I have a 10k pair data (key,value) and search 10 data with the best value.
Actually, I create a simple project to iterate whole data and I need just a couple minute to got the answer.
then, I create mapreduce application with top ten design pattern to solve same problem, and I need more than 4 hour to get the answer. (obviously, I use the same machine and same algorithm to sort)
I think, that probably happens because mapreduce need more service to run, need more network activity, need more effort to read and write to hdfs. Any other's factor to prove that mapreduce (in that condition) is slower than not using mapreduce?
Upvotes: 2
Views: 660
Reputation: 403
mapreduce is slower on a single node setup because only one mapper and one reducer can work on it at any given time. mapper has to iterate through the each one of the splits and the reducer works on two mapper outputs simultaneously and then on two such reducer out puts ans so on..
so In terms of complexity:
for normal project :t(n) = n => O(n)
for mapreduce:t(n) = (n/x)*t(n/2x) => O((n/x)log(n/x)) where x is the number of nodes
which do you think is bigger? for single node and multinode..
explanation for mapreduce complexity:
time for one iteration: n
number of simultaneous map function: x since only one can work on each node
then time required for mapping complete data: n/x since n is the time 1 mapper takes for complete data
for reduce job half of the time is required as compared to the previous map since it works on two mapper outputs simultaneously therefore: time = n/2x for x reducers on x nodes
hence the equation that every next step will take half the time than the previous one.
t(n) = (n/x)*t(n/2x)
solving this recursion we get, O((n/x)log(n/x)).
this is not supposed to be exact but an approximation
Upvotes: 3