Yu Chen
Yu Chen

Reputation: 7460

How is Spark's default partitions calculated for HadoopPartitions?

I'm reading Jacek Laskowski's online book about Apache Spark, and regarding partitioning, he states that

By default, a partition is created for each HDFS partition, which by default is 64MB

I'm not extremely familiar with HDFS, but I ran into some questions replicating this statement. I have a file called Reviews.csv which is about 330MB text file of Amazon food reviews. Given the default 64MB blocks, I'd expect ceiling(330 / 64) = 6 partitions. However, when I load the files into my Spark Shell, I get 9 partitions:

scala> val tokenized_logs = sc.textFile("Reviews.csv")
tokenized_logs: org.apache.spark.rdd.RDD[String] = Reviews.csv MapPartitionsRDD[1] at textFile at <console>:24

scala> tokenized_logs
res0: org.apache.spark.rdd.RDD[String] = Reviews.csv MapPartitionsRDD[1] at textFile at <console>:24

scala> tokenized_logs.partitions
res1: Array[org.apache.spark.Partition] = Array(org.apache.spark.rdd.HadoopPartition@3c1, org.apache.spark.rdd.HadoopPartition@3c2, org.apache.spark.rdd.HadoopPartition@3c3, org.apache.spark.rdd.HadoopPartition@3c4, org.apache.spark.rdd.HadoopPartition@3c5, org.apache.spark.rdd.HadoopPartition@3c6, org.apache.spark.rdd.HadoopPartition@3c7, org.apache.spark.rdd.HadoopPartition@3c8, org.apache.spark.rdd.HadoopPartition@3c9)

scala> tokenized_logs.partitions.size
res2: Int = 9

I do notice that if I create another smaller version of Reviews.csv called Reviews_Smaller.csv that is only 135MB, I have a significantly reduced partition size:

scala> val raw_reviews = sc.textFile("Reviews_Smaller.csv")
raw_reviews: org.apache.spark.rdd.RDD[String] = Reviews_Smaller.csv MapPartitionsRDD[11] at textFile at <console>:24

scala> raw_reviews.partitions.size
res7: Int = 4

However, by my math, there should be ceiling(135 / 4) = 3 partitions, not 4.

I'm running everything locally, on my MacBook Pro. Can anyone help explain how the number of default partitions is calculated for HDFS?

Upvotes: 1

Views: 1495

Answers (1)

mazaneicha
mazaneicha

Reputation: 9427

From Spark Programming Guide:

By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.

128MB is only the default HDFS block size but for any file can actually be something different. The number of partitions in your case means that your file is written using a non-default block size (or more likely, consists of multiple smaller files).

See this excellent SO for the ways to determine the number of blocks an HDFS file is split into.

Upvotes: 4

Related Questions