Reputation: 6206
I have a general question to the MAP/Reduce Framework.
I have a task, which can be separated into several partitions. For each partition, I need to run a computation intensive algorithm.
Then, according to the MAP/Reduce Framework, it seems that I have two choices:
Run the algorithm in the Map stage, so that in the reduce stage, there is no work needed to be done, except collect the results of each partition from the Map stage and do summarization
In the Map stage, just divide and send the partitions (with data) to the reduce stage. In the reduce stage, run the algorithm first, and then collect and summarize the results from each partitions.
Correct me if I misunderstand.
I am a beginner. I may not understand the MAP/Reduce very well. I only have basic parallel computing concept.
Upvotes: 0
Views: 111
Reputation: 860
In MapReduce
, you have a Mapper
and a Reducer
. You also have a Partitioner
and a Combiner
.
Hadoop
is a distributed file system that partitions(or splits, you might say) the file into blocks of BLOCK SIZE
. These partitioned blocks are places on different nodes. So, when a job is submitted to the MapReduce Framework
, it divides that job such that there is a Mapper
for every input split
(for now lets say it is the partitioned block). Since, these blocks are distributed onto different nodes, these Mappers
also run on different nodes.
In the Map
stage,
RecordReader
, the definition of record is controlled by InputFormat
that we choose. Every record is a key-value
pair.map()
of our Mapper
is run for every such record. The output of this step is again in key-value
pairsMapper
is partitioned using the Partitioner
that we provide, or the default HashPartitioner
. Here in this step, by partitioning, I mean deciding which key
and its corresponding values
go to which Reducer
(if there is only one Reducer
, its of no use anyway)reducer
. You can use a Combiner
to do that. Note that, the framework does not guarantee the number of times a Combiner
will be called. It is only part of optimization.This is where your algorithm on the data is usually written. Since these tasks run in parallel, it makes a good candidate for computation intensive tasks.
After all the Mappers
complete running on all nodes, the intermediate data i.e the data at end of Map
stage is copied to their corresponding reducer
.
In the Reduce
stage, the reduce()
of our Reducer
is run on each record of data from the Mappers
. Here the record comprises of a key
and its corresponding values
, not necessarily just one value
. This is where you generally run your summarization/aggregation logic.
When you write your MapReduce
job you usually think about what can be done on each record of data in both the Mapper
and Reducer
. A MapReduce
program can just contain a Mapper
with map()
implemented and a Reducer
with reduce()
implemented. This way you can focus more on what you want to do with the data and not bother about parallelizing. You don't have to worry about how the job is split, the framework does that for you. However, you will have to learn about it sooner or later.
I would suggest you to go through Apache's MapReduce tutorial or Yahoo's Hadoop tutorial for a good overview. I personally like yahoo's explanation of Hadoop but Apache's details are good and their explanation using word count
program is very nice and intuitive.
Also, for
I have a task, which can be separated into several partitions. For each partition, I need to run a computing intensive algorithm.
Hadoop distributed file system has data split onto multiple nodes and map reduce framework assigns a task to every every split. So, in hadoop, the process goes and executes where the data resides. You cannot define the number of map tasks to run, data does. You can however, specify/control the number of reduce tasks.
I hope I have comprehensively answered your question.
Upvotes: 0
Reputation: 1062
You're actually really confused. In a broad and general sense, the map
portion takes the task and divides it among some n many nodes or so. Those n nodes that receive a fraction of the whole task do something with their piece. When finished computing some steps on their data, the reduce operation reassembles the data.
The REAL power of map-reduce is how scalable it is.
Given a dataset D running on a map-reduce cluster m with n nodes under it, each node is mapped 1/D pieces of the task. Then the cluster m with n nodes reduces those pieces into a single element. Now, take a node q to be a cluster n with p nodes under it. If m assigns q 1/D, q can map 1/D to (1/D)/p with respect to n. Then n's nodes can reduce the data back to q where q can supply its data to its neighbors for m.
Make sense?
Upvotes: 0