x4k3p
x4k3p

Reputation: 1779

MapReduce distributed reducer

Just started learning MapReduce and I have a file where there are an actor and a movie he played in (per line). I want to create a file as follows:

actor     movie1, movie2, ..., movieN

i.e. a key - value file but only one line appearance of an actor and all his movies. This is no problem.

After I have this file created I want to find the actor with most movies played in as a second MR - Job. I read my new file (output of the previous Job) and simply replace (in map()) the movies with the number. In my Reducer I just have to compare with previous result

if(numberOfRoles.get() < sum){
        numberOfRoles.set(sum);
        actorWithMostRoles.set(key);
}

where numberOfRoles and actorWithMostRoles are attributes of the Reducer - Class.

This works without any problems.

My output of jps:

$ jps
32347 Jps
25323 DataNode
25145 NameNode
25541 SecondaryNameNode

I know that there can be multiple Mapper & Reducer. For example Reducer_0 and Reducer_1 which will output the actor with the most movies played in. Having following data:

actor1 movie1, movie2, movie3
actor2 movie4, movie5

So Reducer_0 will get actor1 to count and thus output actor1 3 and Reducer_1 will output actor2 2. So I will have two lines instead of one (actor1) - because each Reducer has found the actor.

After I have described my doing I have following question:

Either I don't understand how it works (with multiple reducer - in a cluster) or I have to do synchronisation somehow?

Upvotes: 1

Views: 88

Answers (2)

Ashraful Islam
Ashraful Islam

Reputation: 12840

In the second MR Job read your new file (output of the previous Job)
and change your MR to like this below

Mapping Phase :
Read each actor and their movie count and output it with a special key "max" and value pair of actor name and their movie count like this one

output key = "max"  
output value = ("actor", movieCount)

Reducing Phase :
You will get all of the actor and his movie count as value list in a single reducer so just find the max movie count from the value list

input key = "max"   
input value = [("actor",movie_count), ("actor",movie_count) ...]   
output key = "most movies played"      
output value = max_value

Upvotes: 0

Aman
Aman

Reputation: 8995

Yes, you understand how it works.

You will need another map reduce job to finish it up for you in this setup.

or, just use a single reducer and be done with it!

Upvotes: 1

Related Questions