Reputation: 2714
I am trying to understand RDD partitioning logic. RDD is partitioned across nodes but wants to understand how this partitioning logic works.
I have VM with 4 cores assigned to it. I created two RDD , one from HDFS and one from parallelize operation.
First time two partition got created but in second operation 4 partition got created.
I checked no of blocks allocated to file - it was 1 block as file is very small but when I created RDD on that file , it shows two partitions. Why is this ? I read somewhere that partitioning also depends on no of core which 4 in my case which still does not satisfies that output.
Can someone help to understand this?
Upvotes: 1
Views: 608
Reputation: 3939
The full signature of textFile
is:
textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]
With the second argument, minPartitions
, you can set the minimum amount of partitions you want to get. As you can see, by default it is set to defaultMinPartitions
, which in turn is defined as:
def defaultMinPartitions: Int = math.min(defaultParallelism, 2)
The value of defaultParalellism
is configured with the spark.default.parallelism
setting, which by default depends on your number of cores when running Spark in local mode. This is 4 in your case, so you get min(4, 2)
, which is why you get 2 partitions.
Upvotes: 2