Reputation: 7800
Currently when I want to generate data in Spark I do something like this:
//generates an array list of integers 0...999
final List<Integer> range = range(1000);
JavaRDD<Data> rdd = sc
.parallelize(range)
.mapPartitionsWithIndex(generateData(), false);
For large enough Range I run out of memory (500 million for example).
How can get around this problem?
Upvotes: 2
Views: 1786
Reputation: 97
I am sorry, if the answer is outdated, maybe it could be helpful for others.
@Hyun Joon Kim is correct, but I just want to add more simple option which doesn't use Mllib. Here it is:
sc.parallelize(0 to 1000000)
it returns RDD[Int]
Upvotes: 0
Reputation: 56
The main reason that you are getting out of memory is that you are generating random data from the driver machine and parallelizing it to the other machines.
final List<Integer> range = range(1000);
This line generates random List of Integers which persist in memory of a single machine. (Note that this is Java code and you are not running spark API to generate random data) This is not desirable because probably what you want is generate data that exceeds memory amount of single machine.
So what you need to do is tell each worker(slave) nodes of spark to generate random data by themselves.
If you just want to test out random data, Spark mllib has nice functionality that you can use. (Below code is copied from mllib documentation)
import org.apache.spark.SparkContext;
import org.apache.spark.api.JavaDoubleRDD;
import static org.apache.spark.mllib.random.RandomRDDs.*;
JavaSparkContext jsc = ...
JavaDoubleRDD u = normalJavaRDD(jsc, 1000000L, 10);
The output RDD now contains 1 million Double values generated with standard normal distribution that are distributed to 10 parititons
Upvotes: 4