Reputation: 3477
Is there a RDD method like take but which do not get all the elements in memory. For exemple, I may need to take 10^9 elements of my RDD and keep it as an RDD. What is the best way to do that ?
EDIT: A solution could be to zipWithIndex and filter with index < aBigValue but I am pretty sure there is a better solution.
EDIT 2: The code will be like
sc.parallelize(1 to 100, 2).zipWithIndex().filter(_._2 < 10).map(_._1)
It is a lot of operations just to reduce the size of an RDD :-(
Upvotes: 0
Views: 279
Reputation: 3477
A solution:
yourRDD.zipWithIndex().filter(_._2 < ExactNumberOfElements).map(_._1)
If you want an approximation, take GameOfThrows'solution
Upvotes: 0
Reputation: 4510
I actually quite liked the zipWithIndex + filter mechanism, but if you are looking for an alternative that is sometimes much faster, I would suggest the sample
function as described here: https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/rdd/RDD.html
data.count
...
res1: Long = 1000
val result = data.sample(false, 0.1, System.currentTimeMillis().toInt)
result.count
...
res2: Long = 100
Sample takes the whole RDD and subsets it by a fraction and returns this as another RDD - the problem is that if you are looking for exactly 150 samples from 127112310274 data rows, well, good luck writing that fraction parameter (you could try 150/data.length) - but if you roughly looking for 1-10th of your data, this function works much faster than your take/drop or zip and filter
Upvotes: 1