MapReduce Jaccard Similarity Calculation for movie Recommendations

Question

I am giving an exam on Distributed Systems and I was trying to solve a MapReduce problem from last year's exam. But I am having a hard time figuring out what MR functions I will create. The exercise is about handling a dataset that contains {userID, movieID, timestamp}. We want to built a service that recommends a movie to a user after seiing one. A user(id) has seen the movie(id) in the tuple. To recommend another movie you need to calculate the Jaccard Similarity as such:

Jaccard( X, Y ) = N / (Nx + Ny - N) , where:

Nx = Number of users who saw movie X
Ny = Number of users who saw movie Y
N = Number of users who saw both movie X & Y

The MR functions must be as follows in pseudocode:

MAP(key1, value1):
  // Do stuff about
  emit(key2,value2)


REDUCE(key2,list(value2)):
  //do stuff about 
  emit(key3,value3)

Important: The output of reduce_1 for e.x. must be the input of map_2.

P.S.: It's not a homework as its a past finals exam that's why I din't place it in Homework Questions. (Can give the link to the exam pdf if needed)

I have tried the following for starters:

MAP(key1, value1):
  //key = tupleID
  // value1 = {userID, movieID, timestamp}
  // I discard timestamp as it doesn't offer any help on creating 
     Jaccard similarity.
  emit(movieID,userID)


REDUCE(movieID,list(userID)):
  Nx = 0
  for each user u in list(userID):
     Nx = Nx +1
  emit(movieID,Nx)

I don't know what to do next. I also haven't understand the logic behind MR, as to what the second MR will get as input. For example the MovieID will remain the same or it will get the next movieID in the dataset? Thanks in advance for any explanation given. If you want to better explain the datails of the exercise, please ask.

MapReduce Jaccard Similarity Calculation for movie Recommendations

Answers (1)

Related Questions