Frederico Oliveira
Frederico Oliveira

Reputation: 2473

Spark 2.0 read csv number of partitions (PySpark)

I'm trying to port some code from Spark 1.6 to Spark 2.0 using new stuffs from Spark 2.0. First, I want to use the csv reader from Spark 2.0. BTW, I'm using pyspark.

With the "old" textFile function, I'm able to set the minimum number of partitions. Ex:

file= sc.textFile('/home/xpto/text.csv', minPartitions=10)
header = file.first() #extract header
data = file.filter(lambda x:x !=header) #csv without header
...

Now, with Spark 2.0 I can read the csv directly:

df = spark.read.csv('/home/xpto/text.csv', header=True)
...

But I didn't find a way to set the minPartitions.

I need this to test the performance of my code.

Thx, Fred

Upvotes: 6

Views: 20373

Answers (2)

Vijay Krishna
Vijay Krishna

Reputation: 1067

The short answer is no: you can't set a minimum bar using a mechanism similar to the minPartitions parameter if using a DataFrameReader.

coalesce may be used in this case to reduce the partitions count, and repartition may be used to increase the partition count. When you are using coalesce, downstream performance may be better if you force a shuffle by providing the shuffle parameter (especially in cases of skewed data): coalesce(100,shuffle=True). This triggers a full shuffle of data, which carries cost implications similar to repartition.

Note that the above operations generally do not keep the original order of the file read (excepting if running coalesce without the shuffle parameter), so if a portion of your code depends on the dataset's order, you should avoid a shuffle prior to that point.

Upvotes: 7

Frederico Oliveira
Frederico Oliveira

Reputation: 2473

I figured it out. The DataFrame (and RDD) has a method called "coalesce". Where the number of partitions can be set.

Ex:

>>> df = spark.read.csv('/home/xpto/text.csv', header=True).coalesce(2)
>>> df.rdd.getNumPartitions()
2

In my case, Spark splited my file in 153 partitions. I'm able to set the number of partitions to 10, but when I try to set to 300, it ignores and uses the 153 again (I don't know why).

REF: https://spark.apache.org/docs/2.0.0-preview/api/python/pyspark.sql.html#pyspark.sql.DataFrame.coalesce

Upvotes: 2

Related Questions