Reputation: 1779
Just started learning MapReduce and I have a file where there are an actor and a movie he played in (per line). I want to create a file as follows:
actor movie1, movie2, ..., movieN
i.e. a key - value file but only one line appearance of an actor and all his movies. This is no problem.
After I have this file created I want to find the actor with most movies played in as a second MR - Job. I read my new file (output of the previous Job) and simply replace (in map()
) the movies with the number. In my Reducer I just have to compare with previous result
if(numberOfRoles.get() < sum){
numberOfRoles.set(sum);
actorWithMostRoles.set(key);
}
where numberOfRoles and actorWithMostRoles are attributes of the Reducer - Class.
This works without any problems.
My output of jps:
$ jps
32347 Jps
25323 DataNode
25145 NameNode
25541 SecondaryNameNode
I know that there can be multiple Mapper & Reducer. For example Reducer_0 and Reducer_1 which will output the actor with the most movies played in. Having following data:
actor1 movie1, movie2, movie3
actor2 movie4, movie5
So Reducer_0 will get actor1 to count and thus output actor1 3 and Reducer_1 will output actor2 2. So I will have two lines instead of one (actor1) - because each Reducer has found the actor.
After I have described my doing I have following question:
Either I don't understand how it works (with multiple reducer - in a cluster) or I have to do synchronisation somehow?
Upvotes: 1
Views: 88
Reputation: 12840
In the second MR Job read your new file (output of the previous Job)
and change your MR to like this below
Mapping Phase :
Read each actor and their movie count and output it with a special key "max" and value pair of actor name and their movie count like this one
output key = "max"
output value = ("actor", movieCount)
Reducing Phase :
You will get all of the actor and his movie count as value list in a single reducer so just find the max movie count from the value list
input key = "max"
input value = [("actor",movie_count), ("actor",movie_count) ...]
output key = "most movies played"
output value = max_value
Upvotes: 0
Reputation: 8995
Yes, you understand how it works.
You will need another map reduce job to finish it up for you in this setup.
or, just use a single reducer and be done with it!
Upvotes: 1