Shashi
Shashi

Reputation: 2714

RDD partitioning logic

I am trying to understand RDD partitioning logic. RDD is partitioned across nodes but wants to understand how this partitioning logic works.

I have VM with 4 cores assigned to it. I created two RDD , one from HDFS and one from parallelize operation.

enter image description here

First time two partition got created but in second operation 4 partition got created.

I checked no of blocks allocated to file - it was 1 block as file is very small but when I created RDD on that file , it shows two partitions. Why is this ? I read somewhere that partitioning also depends on no of core which 4 in my case which still does not satisfies that output.

Can someone help to understand this?

Upvotes: 1

Views: 608

Answers (1)

sgvd
sgvd

Reputation: 3939

The full signature of textFile is:

textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]

With the second argument, minPartitions, you can set the minimum amount of partitions you want to get. As you can see, by default it is set to defaultMinPartitions, which in turn is defined as:

def defaultMinPartitions: Int = math.min(defaultParallelism, 2)

The value of defaultParalellism is configured with the spark.default.parallelism setting, which by default depends on your number of cores when running Spark in local mode. This is 4 in your case, so you get min(4, 2), which is why you get 2 partitions.

Upvotes: 2

Related Questions