Reputation: 31546
Recently I was asked (in a class assignment) to find the top 10 occurring words inside RDD. I submitted my assignment with a working solution which looks like
wordsRdd
.map(x => (x, 1))
.reduceByKey(_ + _)
.map(case (x, y) => (y, x))
.sortByKey(false)
.map(case (x, y) => (y, x))
.take(10)
So basically, I swap the tuple, sort by key, and then swap again. Then finally take 10. I don't find the repeated swapping very elegant.
So I wonder if there is a more elegant way of doing this.
I searched and found some people using Scala implicits
to convert the RDD into a Scala Sequence and then doing the sortByValue
, but I don't want to convert RDD to a Scala Seq
, because that will kill the distributed nature of the RDD.
So is there a better way?
Upvotes: 3
Views: 1564
Reputation: 330083
How about this:
wordsRdd.
map(x => (x, 1)).
reduceByKey(_ + _).
takeOrdered(10)(Ordering.by(-1 * _._2))
or a little bit more verbose:
object WordCountPairsOrdering extends Ordering[(String, Int)] {
def compare(a: (String, Int), b: (String, Int)) = b._2.compare(a._2)
}
wordsRdd.
map(x => (x, 1)).
reduceByKey(_ + _).
takeOrdered(10)(WordCountPairsOrdering)
Upvotes: 3