tamjd1
tamjd1

Reputation: 966

What is the most efficient way to partition an input file in pyspark?

I am reading in an input file using PySpark and I'm wondering what's the best way to repartition the input data so it can be spread out evenly across the Mesos cluster.

Currently, I'm doing:

rdd = sc.textFile('filename').repartition(10)

I was looking at sparkContext documentation and I noticed that textFile method has an option called minPartitions which is by default set to None.

I'm wondering if it will be more efficient if I specify my partition value there. For example:

rdd = sc.textFile('filename', 10)

I'm assuming/hoping it will eliminate the need for shuffle after the data has been read in, if I read in the file in chunks to begin with.

Do I understand it correctly? If not, what is the difference between the two methods (if any)?

Upvotes: 0

Views: 983

Answers (1)

zero323
zero323

Reputation: 330063

There are two main differences between these methods:

In general if you load data using textFile there should be no need to further repartition it to get roughly uniform distribution. Since input splits are computed based on amount of data all partitions should be already more or less of the same size. So the only reason to further modify number of partitions is to improve utilization of resources like memory or CPU cores.

Upvotes: 1

Related Questions