In Apache Spark how can I group all the rows of an RDD by two shared values?

Question

I have an RDD of a custom case object which is of the form

{userId:"h245hv45uh", title: "The-BFG", seen: 1, timestamp: 2016-08-06 13:19:53.051000+0000}

Is there any way I can group all rows which have the same userId and title, and then create a single row in a new RDD with the same userId and title but with all the 'seen' values added?

{userId:"h245hv45uh", title: "The-BFG", seen: 71, timestamp: 2016-08-06 13:19:53.051000+0000}

like that ^ if there were 71 rows which had the same userId and title?

The original RDD has several titles and user IDs and I'm trying to aggregate the score, filtering for matching userIds and titles

Thanks

Daniel de Paula · Accepted Answer

You can try converting it into a Pair RDD then using reduceByKey:

def combFunc(cc1: CaseClass, cc2: CaseClass): CaseClass = {
  cc1.copy(seen = cc1.seen + cc2.seen)
}

val newRDD = rdd
  .map( i => ((i.userId, i.title), i) ) // converting into a PairRDD
  .reduceByKey(combFunc) // reducing by key
  .values // converting back to an RDD[CaseClass]

In Apache Spark how can I group all the rows of an RDD by two shared values?

Answers (1)

Related Questions