Reputation: 151
I am giving an exam on Distributed Systems and I was trying to solve a MapReduce problem from last year's exam. But I am having a hard time figuring out what MR functions I will create. The exercise is about handling a dataset that contains {userID, movieID, timestamp}. We want to built a service that recommends a movie to a user after seiing one. A user(id) has seen the movie(id) in the tuple. To recommend another movie you need to calculate the Jaccard Similarity as such:
Jaccard( X, Y ) = N / (Nx + Ny - N) , where:
The MR functions must be as follows in pseudocode:
MAP(key1, value1):
// Do stuff about<key1,value1>
emit(key2,value2)
REDUCE(key2,list(value2)):
//do stuff about <key2, list(value2)>
emit(key3,value3)
Important: The output of reduce_1 for e.x. must be the input of map_2.
P.S.: It's not a homework as its a past finals exam that's why I din't place it in Homework Questions. (Can give the link to the exam pdf if needed)
I have tried the following for starters:
MAP(key1, value1):
//key = tupleID
// value1 = {userID, movieID, timestamp}
// I discard timestamp as it doesn't offer any help on creating
Jaccard similarity.
emit(movieID,userID)
REDUCE(movieID,list(userID)):
Nx = 0
for each user u in list(userID):
Nx = Nx +1
emit(movieID,Nx)
I don't know what to do next. I also haven't understand the logic behind MR, as to what the second MR will get as input. For example the MovieID will remain the same or it will get the next movieID in the dataset? Thanks in advance for any explanation given. If you want to better explain the datails of the exercise, please ask.
Upvotes: 3
Views: 836
Reputation: 836
The map
part of map/reduce transforms (maps) each input record to a (key -> value) pair.
Input record -> x
Map function -> f
Output of map function on input record -> f(x)
In your specific example this you are transforming the tuple of {userID, movieID, timestamp} to the key-value mapping (movieId -> userId) by discarding the timestamp.
The reduce
part of map/reduce takes each key-value mapping that you created from the above and performs an aggregation function (reduce). Since all the data for one key
needs to be located on one node to perform an accurate aggregation computation, the values for a specific key
is moved to the specific node that is responsible for that key
. That is why the reduce
takes input as (key -> List(value)) or for your example (movieId -> List(userIds)). So yes for each reduce
call, the key
will be different.
The output of the reduce
function will be a unique (key -> aggregation_computation(values)) for every input key. For example in your case,
Movie (Id) Number of Users
Star Wars 50
John Wick 32
Fifty Shades of Grey 9000
...
Upvotes: 1