What's the point of the map in MapReduce?

Question

What is the point of map? For example, instead of the below, why not just use: reducer = (accum, x) => accum + (x + 2)?

Or with mapper and reducer separate:

mapper = (x) => x + 2
reducer = (accum, y) => accum + y

So:

//  x     y     
//  0     2
//  1     3

[0, 1].map(mapper).reduce(reducer, 0) // result == 5

Are there examples in "big data technologies" like Hadoop, where moving all the functionality into the reducer is undesirable / incurs some penalty that's avoided by having a separate mapper.

I can think of examples where knowing the initial value is actually required in the reducer; making the use of a "purely map" mapper function impossible or at least pointless as you'd have to be mapping to a value that contains the initial value, e.g. from mapper, returning a tuple containing the initial value so that reducer can access it:

mapper = (x) => [x, lookupValue1[x] * lookupValue2[x]]
reducer = (accum, y) => { accum[y[0]] = y[1]; return accum; }

//  x          y
//  'alex'    ['alex',  -41]
//  'chris'   ['chris', 102]
['alex', 'chris'].map(mapper).reduce(reducer, {})

// result = { 'alex': -41, 'chris': 102 }

Gyanendra Dwivedi · Accepted Answer

Think MapReduce as a design pattern to efficiently process "suitable" data. By this, I mean to say two things:

1) MapReduce is not the efficient way to process all "type" of data. There can be certain type of data and processing steps; which can leverage HDFS and distributed processing. MapReduce is just a tool in that league which is best suited with certain algorithm.

2) Not all algorithm are suitable for mapreduce. Because its a design pattern, it best suits for certain algorithms which are inline with a certain design. That is why mapreduce core library allow you to skip mapper (using identity Mapping) or reducer (just by setting number of reducer as zero). You are allowed to skip one one more phases of mapreduce according to your need.

Keeping these two point in center, if you understand how map-combine - sort + shuffle - reduce works; it can help you to implement an algorithm which is more efficient than using any other tool. At the same time, if your data and algorithm is really not a 'fit' to mapreduce, you could end up with a highly inefficient mapreduce program.

If you wish to research on significance of mapper in mapreduce, just study the wordcount program example (comes bundled with mapreduce). Try implementing it with/without mapper (or reducer or mapreduce altogether) and benchmark the performance. I hope you would find the answer.

What's the point of the map in MapReduce?

Answers (1)

Related Questions

What&#39;s the point of the map in MapReduce?

Answers (1)

Related Questions

What's the point of the map in MapReduce?