user2327621
user2327621

Reputation: 997

[Scala/Scalding]: map ID to name

I am fairly new to Scalding and I am trying to write a scalding program that takes as input 2 datasets: 1) book_id_title: ('id,'title): contains the mapping between book ID and book title, Both are strings. 2) book_sim: ('id1, 'id2, 'sim): contains the similarity between pairs of books, identified by their IDs.

The goal of the scalding program is to replace each (id1, id2) in book_ratings with their respective titles by looking up the book_id_title table. However, I am not able to retrieve the title. I would appreciate it if someone could help with the getTitle() function below.

My scalding code is as follows:

  // read in the mapping between book id and title from a csv file
  val book_id_title =
       Csv(book_file, fields=book_format)
         .read
         .project('id,'title)

   // read in the similarity data from a csv file and map the ids to the titles
   // by calling getTitle function
  val result = 
      book_sim
      .map(('id1, 'id2)->('title1, 'title2)) {
           pair:(String,String)=> (getTitle(pair._1), getTitle(pair._2))
       }
      .write(out)


  // function that searches for the id and retrieves the title
  def getTitle(search_id: String) = {
      val btitle = 
         book_id_title
           .filter('id){id:String => id == search_id} // extract row matching the id
           .project('title)  // get the title
   }

thanks

Upvotes: 0

Views: 218

Answers (1)

Sasha O
Sasha O

Reputation: 3749

Hadoop is a batch processing system and there is no way to lookup data by index. Instead, you need to join book_id_title and book_sim by id, probably two times: for left and right ids. Something like:

book_sim.joinWithSmaller('id1->id, book_id_title).joinWithSmaller('id2->id, book_id_title)

I am not very familiar with the field-based API so consider the above as a pseudocode. You also need to add appropriate projections. Hopefully, it still gives you an idea.

Upvotes: 1

Related Questions