Claudiu
Claudiu

Reputation: 229291

Parallelizing the "Reduce" in "MapReduce"

I understand how Map is easily parallelizable - each computer/CPU can just operate on a small portion of the array.

Is Reduce/foldl parallelizable? It seems like each computation depends on the previous one. Is it just parallelizable for certain types of functions?

Upvotes: 11

Views: 4492

Answers (6)

cdiggins
cdiggins

Reputation: 18203

Technically a reduce is not the same as a foldl (fold-left) which can also be described as an accumulate.

The example given by Jules illustrates a reduce operation very well:

step 1: 1 + 2 + 3 + 4 
step 2:   3   +   7   
step 3:       10      

Note that at each step the result is an array, including the final result which is an array of one item.

A fold-left is like the following:

step 0: a = 0
step 1: a = a + 1 
step 2: a = a + 2 
step 3: a = a + 3
step 4: a = a + 4
step 5: a

Now obviously these both produce the same results, but a foldl has a well-defined result when given a non-associative operator (like subtraction) whereas a reduce operator doesn't.

Upvotes: 1

bradheintz
bradheintz

Reputation: 3152

It depends on your Reduce step. In a Hadoop-style implementation of MapReduce, your Reducer is getting called once per key, with all the rows relevant to that key.

So, for example, your Mapper might be taking in a lot of unordered web server logs, adding some metadata (e.g., geocoding), and emitting [key, record] pairs with a cookie ID as the key. Your Reducer would then be called once per cookie ID and would be fed all the data for that cookie, and could compute aggregate info such as visit frequency or average pages viewed per visit. Or you could key on geocode data, and gather aggregate stats based on geography.

Even if you're not doing per-key aggregate analysis - indeed, even if you're computing something over the whole set - it might be possible to break your computation into chunks, each of which could be fed to a Reducer.

Upvotes: 0

Piotr Lesnicki
Piotr Lesnicki

Reputation: 9730

If your reduction underlying operation is associative*, you can play with the order of operations and locality. Therefore you often have a tree-like structure in the 'gather' phase, so you can do it in several passes in logarithmic time:

a  +  b  +  c  +  d
 \   /       \   /
 (a+b)       (c+d)
     \       /
   ((a+b)+(c+d))

instead of (((a+b)+c)+d)

If your operation is commutative, further optimization are possible as you can gather in different order (it may be important for data alignment when those operations are vector operations for example)

[*] your real desired mathematical operations, not those on effective types like floats of course.

Upvotes: 15

Jason Ganetsky
Jason Ganetsky

Reputation: 421

Check out the combine phase in Hadoop

http://wiki.apache.org/hadoop/HadoopMapReduce

Upvotes: 3

Jules
Jules

Reputation: 6346

Yes, if the operator is associative. For example, you can parallelise summing a list of numbers:

step 1: 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8
step 2:   3   +   7   +   11  +   15
step 3:       10      +       26
step 4:               36

This works because (a+b)+c = a+(b+c), i.e. the order in which the additions are performed doesn't matter.

Upvotes: 6

strager
strager

Reputation: 90012

Not sure what platform/language you're thinking of, but you can parallelize reduce operators like this:

// Original
result = null;
foreach(item in map) {
    result += item;
}

// Parallel
resultArray = array();
mapParts = map.split(numThreads);
foreach(thread) {
    result = null;
    foreach(item in mapParts[thread]) {
        result += item;
    }
    resultArray += result;    // Lock this!
}
waitForThreads();
reduce(resultArray);

As you can see, a parallel implementation is easily recursive. You split the map up, operate on each part in its own thread, then perform another reduce once those threads are done to bring the pieces together.

(This is the programmatic reasoning behind Piotr Lesnick's answer.)

Upvotes: 1

Related Questions