Benjamin
Benjamin

Reputation: 3477

Spark need an RDD.take with a big argument. Result should be an RDD

Is there a RDD method like take but which do not get all the elements in memory. For exemple, I may need to take 10^9 elements of my RDD and keep it as an RDD. What is the best way to do that ?

EDIT: A solution could be to zipWithIndex and filter with index < aBigValue but I am pretty sure there is a better solution.

EDIT 2: The code will be like

sc.parallelize(1 to 100, 2).zipWithIndex().filter(_._2 < 10).map(_._1)

It is a lot of operations just to reduce the size of an RDD :-(

Upvotes: 0

Views: 279

Answers (2)

Benjamin
Benjamin

Reputation: 3477

A solution:

yourRDD.zipWithIndex().filter(_._2 < ExactNumberOfElements).map(_._1)

If you want an approximation, take GameOfThrows'solution

Upvotes: 0

GameOfThrows
GameOfThrows

Reputation: 4510

I actually quite liked the zipWithIndex + filter mechanism, but if you are looking for an alternative that is sometimes much faster, I would suggest the sample function as described here: https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/rdd/RDD.html

data.count
...
res1: Long = 1000
val result = data.sample(false, 0.1, System.currentTimeMillis().toInt)
result.count
...
res2: Long = 100

Sample takes the whole RDD and subsets it by a fraction and returns this as another RDD - the problem is that if you are looking for exactly 150 samples from 127112310274 data rows, well, good luck writing that fraction parameter (you could try 150/data.length) - but if you roughly looking for 1-10th of your data, this function works much faster than your take/drop or zip and filter

Upvotes: 1

Related Questions