How to generate lots of data in Spark?

Question

Currently when I want to generate data in Spark I do something like this:

//generates an array list of integers 0...999
final List range = range(1000);
JavaRDD rdd = sc
   .parallelize(range)
   .mapPartitionsWithIndex(generateData(), false);

For large enough Range I run out of memory (500 million for example).

How can get around this problem?

Hyun Joon Kim · Accepted Answer

The main reason that you are getting out of memory is that you are generating random data from the driver machine and parallelizing it to the other machines.

final List range = range(1000);

This line generates random List of Integers which persist in memory of a single machine. (Note that this is Java code and you are not running spark API to generate random data) This is not desirable because probably what you want is generate data that exceeds memory amount of single machine.

So what you need to do is tell each worker(slave) nodes of spark to generate random data by themselves.

If you just want to test out random data, Spark mllib has nice functionality that you can use. (Below code is copied from mllib documentation)

import org.apache.spark.SparkContext;
import org.apache.spark.api.JavaDoubleRDD;
import static org.apache.spark.mllib.random.RandomRDDs.*;
JavaSparkContext jsc = ...

JavaDoubleRDD u = normalJavaRDD(jsc, 1000000L, 10);

The output RDD now contains 1 million Double values generated with standard normal distribution that are distributed to 10 parititons

How to generate lots of data in Spark?

Answers (2)

Related Questions