Implementation of Spark's parallelize collections

Question

I have a list of strings which I convert into an RDD:-

JavaRDD stringRDD = jsc.parallelize(strings,5);

As per my understanding, when we do jsc.textFile(filename,5), then each slave node will parse their individual portions (say from S3), and store the RDD on their memory.

What will be the behaviour in case of parallelize()? Does the whole list get passed to each slave node?

philantrovert · Accepted Answer

In the line :

JavaRDD stringRDD = jsc.parallelize(strings,5);

The second parameter 5 denotes the number of partitions you want to create for stringRDD. If you have 5 workers, they will receive one partition each to work with and perform whatever operation you performed in your code.

If your List strings has less then 5 elements, then one partition will most likely be empty and the worker to which that partition goes to will be idle.

then each slave node will parse their individual portions (say from S3), and store the RDD on their memory

Each slave node will parse their partition but won't store the resultant RDD in memory unless stated otherwise by calling cache or persist on the resultant RDD. The RDD will only be computed in-memory.

Implementation of Spark's parallelize collections

Answers (1)

Related Questions

Implementation of Spark&#39;s parallelize collections

Answers (1)

Related Questions

Implementation of Spark's parallelize collections