MapReduce distributed reducer

Question

Just started learning MapReduce and I have a file where there are an actor and a movie he played in (per line). I want to create a file as follows:

actor     movie1, movie2, ..., movieN

i.e. a key - value file but only one line appearance of an actor and all his movies. This is no problem.

After I have this file created I want to find the actor with most movies played in as a second MR - Job. I read my new file (output of the previous Job) and simply replace (in map()) the movies with the number. In my Reducer I just have to compare with previous result

if(numberOfRoles.get() < sum){
        numberOfRoles.set(sum);
        actorWithMostRoles.set(key);
}

where numberOfRoles and actorWithMostRoles are attributes of the Reducer - Class.

This works without any problems.

My output of jps:

$ jps
32347 Jps
25323 DataNode
25145 NameNode
25541 SecondaryNameNode

I know that there can be multiple Mapper & Reducer. For example Reducer_0 and Reducer_1 which will output the actor with the most movies played in. Having following data:

actor1 movie1, movie2, movie3
actor2 movie4, movie5

So Reducer_0 will get actor1 to count and thus output actor1 3 and Reducer_1 will output actor2 2. So I will have two lines instead of one (actor1) - because each Reducer has found the actor.

After I have described my doing I have following question:

Either I don't understand how it works (with multiple reducer - in a cluster) or I have to do synchronisation somehow?

Aman · Accepted Answer

Yes, you understand how it works.

You will need another map reduce job to finish it up for you in this setup.

or, just use a single reducer and be done with it!

MapReduce distributed reducer

Answers (2)

Related Questions