Reputation: 966
I am reading in an input file using PySpark and I'm wondering what's the best way to repartition the input data so it can be spread out evenly across the Mesos cluster.
Currently, I'm doing:
rdd = sc.textFile('filename').repartition(10)
I was looking at sparkContext
documentation and I noticed that textFile
method has an option called minPartitions
which is by default set to None
.
I'm wondering if it will be more efficient if I specify my partition value there. For example:
rdd = sc.textFile('filename', 10)
I'm assuming/hoping it will eliminate the need for shuffle after the data has been read in, if I read in the file in chunks to begin with.
Do I understand it correctly? If not, what is the difference between the two methods (if any)?
Upvotes: 0
Views: 983
Reputation: 330063
There are two main differences between these methods:
repartition
shuffles the data after loading while using minPartitions
doesn'trepartition
results in exact number of partitions while minPartitions
provides only a lower bound (see Why does partition parameter of SparkContext.textFile not take effect?)In general if you load data using textFile
there should be no need to further repartition it to get roughly uniform distribution. Since input splits are computed based on amount of data all partitions should be already more or less of the same size. So the only reason to further modify number of partitions is to improve utilization of resources like memory or CPU cores.
Upvotes: 1