Ayan
Ayan

Reputation: 411

How the number of partitions is decided by Spark when a file is read?

How the number of partitions is decided by Spark when a file is read ?

Suppose we have a 10 GB single file in a hdfs directory and multiple part files of total 10 GB volume a another hdfs location .

If these two files are read in two separate spark data frames what would be their number of partitions and based on what logic ?

Upvotes: 3

Views: 7193

Answers (1)

Anand Sai
Anand Sai

Reputation: 1586

Found the information in How to: determine partition It says:

How is this number determined? The way Spark groups RDDs into stages is described in the previous post. (As a quick reminder, transformations like repartition and reduceByKey induce stage boundaries.) The number of tasks in a stage is the same as the number of partitions in the last RDD in the stage. The number of partitions in an RDD is the same as the number of partitions in the RDD on which it depends, with a couple exceptions: thecoalesce transformation allows creating an RDD with fewer partitions than its parent RDD, the union transformation creates an RDD with the sum of its parents’ number of partitions, and cartesian creates an RDD with their product.

What about RDDs with no parents? RDDs produced by textFile or hadoopFile have their partitions determined by the underlying MapReduce InputFormat that’s used. Typically there will be a partition for each HDFS block being read. Partitions for RDDs produced by parallelize come from the parameter given by the user, or spark.default.parallelism if none is given.

When Spark reads a file from HDFS, it creates a single partition for a single input split. Input split is set by the Hadoop InputFormat used to read this file. For instance, if you use textFile() it would be TextInputFormat in Hadoop, which would return you a single partition for a single block of HDFS (but the split between partitions would be done on line split, not the exact block split), unless you have a compressed text file. In case of compressed file you would get a single partition for a single file (as compressed text files are not splittable).

If you have a 10GB uncompressed text file stored on HDFS, then with the default HDFS block size setting (128MB) it would be stored in 79 blocks, which means that the RDD you read from this file would have 79 partitions.

Also, we can pass the number of partitions we want if we are not satisfied by the number of partitions provided by spark by default as shown below:

>>> rdd1 = sc.textFile("statePopulations.csv",10) // 10 is number of partitions 

Upvotes: 2

Related Questions