Reputation: 1113
I have a list of strings which I convert into an RDD:-
JavaRDD<String> stringRDD = jsc.parallelize(strings,5);
As per my understanding, when we do jsc.textFile(filename,5)
, then each slave node will parse their individual portions (say from S3), and store the RDD on their memory.
What will be the behaviour in case of parallelize()
? Does the whole list get passed to each slave node?
Upvotes: 0
Views: 223
Reputation: 10092
In the line :
JavaRDD<String> stringRDD = jsc.parallelize(strings,5);
The second parameter 5
denotes the number of partitions you want to create for stringRDD
. If you have 5 workers, they will receive one partition each to work with and perform whatever operation you performed in your code.
If your List strings
has less then 5 elements, then one partition will most likely be empty and the worker to which that partition goes to will be idle.
then each slave node will parse their individual portions (say from S3), and store the RDD on their memory
Each slave node will parse their partition but won't store the resultant RDD in memory unless stated otherwise by calling cache
or persist
on the resultant RDD. The RDD will only be computed in-memory.
Upvotes: 2