How partitions are created in spark RDD

Question

Let's say I am reading a file from HDFS using spark(scala). A HDFS block size is 64 MB.

Assume , the size of HDFS file is 130 MB.

I would like to know how many partitions are created in base RDD

scala> val distFile = sc.textFile("hdfs://user/cloudera/data.txt")

Is it true that no. of partitions are decided based on block size?

In the above case the no. of partitions is 3?

Amit Kulkarni · Accepted Answer

Here is a good article that describes the partition computation logic for input.

The HDFS block size is the maximum size of a partition. So in your example the minimum number of partitions will be 3.

partitions = ceiling(input size/block size)

You can further increase the number of partitions by passing that as a parameter to sc.textFile as in sc.textFile(inputPath,numPartitions)

Also another setting mapreduce.input.fileinputformat.split.minsize plays a role. You can set it to increase the size of partitions (and reduce the number of partitions). So if you set mapreduce.input.fileinputformat.split.minsize to say 130MB then you will only get 1 partition.

How partitions are created in spark RDD

Answers (2)

Related Questions