Knows Not Much
Knows Not Much

Reputation: 31546

SortByValue for a RDD of tuples

Recently I was asked (in a class assignment) to find the top 10 occurring words inside RDD. I submitted my assignment with a working solution which looks like

wordsRdd
  .map(x => (x, 1))
  .reduceByKey(_ + _)
  .map(case (x, y) => (y, x))
  .sortByKey(false)
  .map(case (x, y) => (y, x))
  .take(10)

So basically, I swap the tuple, sort by key, and then swap again. Then finally take 10. I don't find the repeated swapping very elegant.

So I wonder if there is a more elegant way of doing this.

I searched and found some people using Scala implicits to convert the RDD into a Scala Sequence and then doing the sortByValue, but I don't want to convert RDD to a Scala Seq, because that will kill the distributed nature of the RDD.

So is there a better way?

Upvotes: 3

Views: 1564

Answers (1)

zero323
zero323

Reputation: 330083

How about this:

wordsRdd.
    map(x => (x, 1)).
    reduceByKey(_ + _).
    takeOrdered(10)(Ordering.by(-1 * _._2))

or a little bit more verbose:

object WordCountPairsOrdering extends Ordering[(String, Int)] {
    def compare(a: (String, Int), b: (String, Int)) = b._2.compare(a._2)
}

wordsRdd.
    map(x => (x, 1)).
    reduceByKey(_ + _).
    takeOrdered(10)(WordCountPairsOrdering)

Upvotes: 3

Related Questions