MapReduce Jaccard Similarity Calculation for movie Recommendations

I am giving an exam on Distributed Systems and I was trying to solve a MapReduce problem from last year's exam. But I am having a hard time figuring out what MR functions I will create. The exercise is about handling a dataset that contains {userID, movieID, timestamp}. We want to built a service that recommends a movie to a user after seiing one. A user(id) has seen the movie(id) in the tuple. To recommend another movie you need to calculate the Jaccard Similarity as such:

Jaccard( X, Y ) = N / (Nx + Ny - N) , where:

The MR functions must be as follows in pseudocode:

MAP(key1, value1):
  // Do stuff about<key1,value1>
  emit(key2,value2)


REDUCE(key2,list(value2)):
  //do stuff about <key2, list(value2)>
  emit(key3,value3)

Important: The output of reduce_1 for e.x. must be the input of map_2.

P.S.: It's not a homework as its a past finals exam that's why I din't place it in Homework Questions. (Can give the link to the exam pdf if needed)

I have tried the following for starters:

MAP(key1, value1):
  //key = tupleID
  // value1 = {userID, movieID, timestamp}
  // I discard timestamp as it doesn't offer any help on creating 
     Jaccard similarity.
  emit(movieID,userID)


REDUCE(movieID,list(userID)):
  Nx = 0
  for each user u in list(userID):
     Nx = Nx +1
  emit(movieID,Nx)

I don't know what to do next. I also haven't understand the logic behind MR, as to what the second MR will get as input. For example the MovieID will remain the same or it will get the next movieID in the dataset? Thanks in advance for any explanation given. If you want to better explain the datails of the exercise, please ask.

Upvotes: 3

Views: 836

Answers (1)

yeenow123
yeenow123

Reputation: 836

The map part of map/reduce transforms (maps) each input record to a (key -> value) pair.

Input record -> x Map function -> f Output of map function on input record -> f(x)

In your specific example this you are transforming the tuple of {userID, movieID, timestamp} to the key-value mapping (movieId -> userId) by discarding the timestamp.

The reduce part of map/reduce takes each key-value mapping that you created from the above and performs an aggregation function (reduce). Since all the data for one key needs to be located on one node to perform an accurate aggregation computation, the values for a specific key is moved to the specific node that is responsible for that key. That is why the reduce takes input as (key -> List(value)) or for your example (movieId -> List(userIds)). So yes for each reduce call, the key will be different.

The output of the reduce function will be a unique (key -> aggregation_computation(values)) for every input key. For example in your case,

Movie (Id)              Number of Users
Star Wars               50
John Wick               32
Fifty Shades of Grey    9000
...

Upvotes: 1

Related Questions