Reputation: 5351
I am looking for a Spark RDD operation like top
or takeOrdered
, but that returns another RDD, not an Array, that is, does not collect the full result to RAM.
It can be a sequence of operations, but ideally, in no step trying to collect the full result into the memory of a single node.
Upvotes: 3
Views: 315
Reputation: 27455
Let's say you want to have the top 50% of an RDD.
def top50(rdd: RDD[(Double, String)]) = {
val sorted = rdd.sortByKey(ascending = false)
val partitions = sorted.partitions.size
// Throw away the contents of the lower partitions.
sorted.mapPartitionsWithIndex { (pid, it) =>
if (pid <= partitions / 2) it else Nil
}
}
This is an approximation — you may get more or less than 50%. You could do better but it would cost an extra evaluation of the RDD. For the use cases I have in mind this would not be worth it.
Upvotes: 2
Reputation: 6186
Take a look at
import org.apache.spark.mllib.rdd.MLPairRDDFunctions._
val rdd: RDD[(String, Int)] // the first string is the key, the rest is the value
val topByKey:RDD[(String, Array[Int])] = rdd.topByKey(n)
Or use aggregate
with BoundedPriorityQueue
.
Upvotes: 0