Aaron O'Donnell
Aaron O'Donnell

Reputation: 91

In Apache Spark how can I group all the rows of an RDD by two shared values?

I have an RDD of a custom case object which is of the form

{userId:"h245hv45uh", title: "The-BFG", seen: 1, timestamp: 2016-08-06 13:19:53.051000+0000}

Is there any way I can group all rows which have the same userId and title, and then create a single row in a new RDD with the same userId and title but with all the 'seen' values added?

{userId:"h245hv45uh", title: "The-BFG", seen: 71, timestamp: 2016-08-06 13:19:53.051000+0000}

like that ^ if there were 71 rows which had the same userId and title?

The original RDD has several titles and user IDs and I'm trying to aggregate the score, filtering for matching userIds and titles

Thanks

Upvotes: 3

Views: 920

Answers (1)

Daniel de Paula
Daniel de Paula

Reputation: 17872

You can try converting it into a Pair RDD then using reduceByKey:

def combFunc(cc1: CaseClass, cc2: CaseClass): CaseClass = {
  cc1.copy(seen = cc1.seen + cc2.seen)
}

val newRDD = rdd
  .map( i => ((i.userId, i.title), i) ) // converting into a PairRDD
  .reduceByKey(combFunc) // reducing by key
  .values // converting back to an RDD[CaseClass]

Upvotes: 2

Related Questions