Reputation: 91
I have an RDD of a custom case object which is of the form
{userId:"h245hv45uh", title: "The-BFG", seen: 1, timestamp: 2016-08-06 13:19:53.051000+0000}
Is there any way I can group all rows which have the same userId and title, and then create a single row in a new RDD with the same userId and title but with all the 'seen' values added?
{userId:"h245hv45uh", title: "The-BFG", seen: 71, timestamp: 2016-08-06 13:19:53.051000+0000}
like that ^ if there were 71 rows which had the same userId and title?
The original RDD has several titles and user IDs and I'm trying to aggregate the score, filtering for matching userIds and titles
Thanks
Upvotes: 3
Views: 920
Reputation: 17872
You can try converting it into a Pair RDD then using reduceByKey
:
def combFunc(cc1: CaseClass, cc2: CaseClass): CaseClass = {
cc1.copy(seen = cc1.seen + cc2.seen)
}
val newRDD = rdd
.map( i => ((i.userId, i.title), i) ) // converting into a PairRDD
.reduceByKey(combFunc) // reducing by key
.values // converting back to an RDD[CaseClass]
Upvotes: 2