London guy
London guy

Reputation: 28022

Is there a way to control the number of partitions when reading a text file in PySpark

I am reading a text file using the following command in PySpark

rating_data_raw = sc.textFile("/<path_to_csv_file>.csv")

Is there a way to specify the number of partitions that RDD rating_data_raw should be split into? I want to specify a large number of partitions for greater concurrency.

Upvotes: 3

Views: 7139

Answers (2)

Fatih Kahraman
Fatih Kahraman

Reputation: 11

It is also possible to read a .csv file and then check partitions with df to RDD conversion. I leave an example structure below.

dataset = spark.read.csv("data.csv", header=True, inferSchema='True')
colsDrop = ("data_index", "_c17", "song_title", "artist")
df = dataset.drop(*colsDrop)
rdd = sc.parallelize(df.collect()).partitionBy(8)

Here .partitionBy() allows you to control the partition number of an RDD object. It is also possible to find out these numbers with the .getNumPartition() method.

The only thing to note is that giving the number of partitions more than the number of threads on the CPU will not give us speed gain.

For example, the amount of threads in my CPU is 8, you can see a sample time distribution below.

enter image description here

enter image description here

enter image description here

enter image description here

enter image description here

enter image description here

As you can see, I can't gain speed after 8 partitions.

Upvotes: 0

Alberto Bonsanto
Alberto Bonsanto

Reputation: 18022

As other user said you can set the minimal number of partitions that will be created, while reading the file, by setting it in the optional parameter minPartitions of textFile.

rating_data_raw = sc.textFile("/<path_to_csv_file>.csv", minPartitions=128)

Another way to achieve this is by using repartition or coalesce, if you need to reduce the number of partition you may use coalesce, otherwise you can use repartition.

rating_data_raw = sc.textFile("/<path_to_csv_file>.csv").repartition(128)

Upvotes: 5

Related Questions