Spark 2.0 read csv number of partitions (PySpark)

Question

I'm trying to port some code from Spark 1.6 to Spark 2.0 using new stuffs from Spark 2.0. First, I want to use the csv reader from Spark 2.0. BTW, I'm using pyspark.

With the "old" textFile function, I'm able to set the minimum number of partitions. Ex:

file= sc.textFile('/home/xpto/text.csv', minPartitions=10)
header = file.first() #extract header
data = file.filter(lambda x:x !=header) #csv without header
...

Now, with Spark 2.0 I can read the csv directly:

df = spark.read.csv('/home/xpto/text.csv', header=True)
...

But I didn't find a way to set the minPartitions.

I need this to test the performance of my code.

Thx, Fred

Vijay Krishna · Accepted Answer

The short answer is no: you can't set a minimum bar using a mechanism similar to the minPartitions parameter if using a DataFrameReader.

coalesce may be used in this case to reduce the partitions count, and repartition may be used to increase the partition count. When you are using coalesce, downstream performance may be better if you force a shuffle by providing the shuffle parameter (especially in cases of skewed data): coalesce(100,shuffle=True). This triggers a full shuffle of data, which carries cost implications similar to repartition.

Note that the above operations generally do not keep the original order of the file read (excepting if running coalesce without the shuffle parameter), so if a portion of your code depends on the dataset's order, you should avoid a shuffle prior to that point.

Spark 2.0 read csv number of partitions (PySpark)

Answers (2)

Related Questions