Sanchay
Sanchay

Reputation: 1113

Implementation of Spark's parallelize collections

I have a list of strings which I convert into an RDD:-

JavaRDD<String> stringRDD = jsc.parallelize(strings,5);

As per my understanding, when we do jsc.textFile(filename,5), then each slave node will parse their individual portions (say from S3), and store the RDD on their memory.

What will be the behaviour in case of parallelize()? Does the whole list get passed to each slave node?

Upvotes: 0

Views: 223

Answers (1)

philantrovert
philantrovert

Reputation: 10092

In the line :

JavaRDD<String> stringRDD = jsc.parallelize(strings,5);

The second parameter 5 denotes the number of partitions you want to create for stringRDD. If you have 5 workers, they will receive one partition each to work with and perform whatever operation you performed in your code.

If your List strings has less then 5 elements, then one partition will most likely be empty and the worker to which that partition goes to will be idle.

then each slave node will parse their individual portions (say from S3), and store the RDD on their memory

Each slave node will parse their partition but won't store the resultant RDD in memory unless stated otherwise by calling cache or persist on the resultant RDD. The RDD will only be computed in-memory.

Upvotes: 2

Related Questions