Is there a way to control the number of partitions when reading a text file in PySpark

Question

I am reading a text file using the following command in PySpark

rating_data_raw = sc.textFile("/.csv")

Is there a way to specify the number of partitions that RDD rating_data_raw should be split into? I want to specify a large number of partitions for greater concurrency.

Alberto Bonsanto · Accepted Answer

As other user said you can set the minimal number of partitions that will be created, while reading the file, by setting it in the optional parameter minPartitions of textFile.

rating_data_raw = sc.textFile("/.csv", minPartitions=128)

Another way to achieve this is by using repartition or coalesce, if you need to reduce the number of partition you may use coalesce, otherwise you can use repartition.

rating_data_raw = sc.textFile("/.csv").repartition(128)

Is there a way to control the number of partitions when reading a text file in PySpark

Answers (2)

Related Questions