Reputation: 3609
Let's say I am reading a file from HDFS using spark(scala). A HDFS block size is 64 MB.
Assume , the size of HDFS file is 130 MB.
I would like to know how many partitions are created in base RDD
scala> val distFile = sc.textFile("hdfs://user/cloudera/data.txt")
Is it true that no. of partitions are decided based on block size?
In the above case the no. of partitions is 3?
Upvotes: 1
Views: 442
Reputation: 965
Here is a good article that describes the partition computation logic for input.
The HDFS block size is the maximum size of a partition. So in your example the minimum number of partitions will be 3.
partitions = ceiling(input size/block size)
You can further increase the number of partitions by passing that as a parameter to sc.textFile
as in sc.textFile(inputPath,numPartitions)
Also another setting mapreduce.input.fileinputformat.split.minsize
plays a role. You can set it to increase the size of partitions (and reduce the number of partitions). So if you set mapreduce.input.fileinputformat.split.minsize
to say 130MB
then you will only get 1 partition.
Upvotes: 3