Sriram Rag
Sriram Rag

Reputation: 31

Spark Partitions: Loading a file from the local file system on a Single Node Cluster

I am interested in finding out how Spark creates partitions when loading a file from the local file system.

I am using the Databricks Community Edition to learn Spark. While I load a file that is just a few kilobytes in size (about 300 kb) using the sc.textfile command, spark, by default creates 2 partitions (as given by partitions.length). When I load a file that is about 500 MB, it creates 8 partitions (which is equal to the number of cores in the machine).

enter image description here

What is the logic here?

Also, I learnt from documentation that if we are loading from the local file system and using a cluster, the file has to be in the same location on all the machines that belong to the cluster. Will this not create duplicates? How does Spark handle this scenario? If you can point to articles that throw light on this, it will be of great help.

Thanks!

Upvotes: 3

Views: 2139

Answers (1)

Lakshman Battini
Lakshman Battini

Reputation: 1912

When Spark reading from the Local file system the default number of Partitions (identified by defaultParallelism) is the number of all available cores.

sc.textFile calculates the number of partitions as the minimum between defaultParallelism ( available cores in case of Local FS) and 2.

def defaultMinPartitions: Int = math.min(defaultParallelism, 2)

Referred from: spark code

In 1st case: the file size - 300KB

Number of partitions are calculated as 2, as file size is very less.

In 2nd case: file size - 500MB

Number of partitions are equal to the defaultParallelism. In your case, it is 8.

When reading from HDFS, sc.textFile will take the maximum between minPartitions and the number of splits computed based on hadoop input split size divided by the block size.

However, when using textFile with compressed files (file.txt.gz not file.txt or similar), Spark disables splitting that makes for an RDD with only 1 partition (as reads against gzipped files cannot be parallelized).

For your 2nd query regarding reading data from Local path in Cluster:

Files need to be available on all the machines in the cluster, because Spark may launch the executors on machines in the cluster, and executors will read the file using (file://).

To avoid copying the files to all the machines, if your data is already in one of the network file systems like NFS, AFS, and MapR’s NFS layer, then you can use it as an input by just specifying a file:// path; Spark will handle it as long as the filesystem is mounted at the same path on each node. Every node needs to have the same path. Please Refer to: https://community.hortonworks.com/questions/38482/loading-local-file-to-apache-spark.html

Upvotes: 5

Related Questions