Little Bobby Tables
Little Bobby Tables

Reputation: 5351

Spark RDD operation like top returning a smaller RDD

I am looking for a Spark RDD operation like top or takeOrdered, but that returns another RDD, not an Array, that is, does not collect the full result to RAM.

It can be a sequence of operations, but ideally, in no step trying to collect the full result into the memory of a single node.

Upvotes: 3

Views: 315

Answers (2)

Daniel Darabos
Daniel Darabos

Reputation: 27455

Let's say you want to have the top 50% of an RDD.

def top50(rdd: RDD[(Double, String)]) = {
  val sorted = rdd.sortByKey(ascending = false)
  val partitions = sorted.partitions.size
  // Throw away the contents of the lower partitions.
  sorted.mapPartitionsWithIndex { (pid, it) =>
    if (pid <= partitions / 2) it else Nil
  }
}

This is an approximation — you may get more or less than 50%. You could do better but it would cost an extra evaluation of the RDD. For the use cases I have in mind this would not be worth it.

Upvotes: 2

emesday
emesday

Reputation: 6186

Take a look at

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala

import org.apache.spark.mllib.rdd.MLPairRDDFunctions._
val rdd: RDD[(String, Int)] // the first string is the key, the rest is the value

val topByKey:RDD[(String, Array[Int])] = rdd.topByKey(n)

Or use aggregate with BoundedPriorityQueue.

Upvotes: 0

Related Questions