Reputation: 28022
I am reading a text file using the following command in PySpark
rating_data_raw = sc.textFile("/<path_to_csv_file>.csv")
Is there a way to specify the number of partitions that RDD rating_data_raw should be split into? I want to specify a large number of partitions for greater concurrency.
Upvotes: 3
Views: 7139
Reputation: 11
It is also possible to read a .csv file and then check partitions with df to RDD conversion. I leave an example structure below.
dataset = spark.read.csv("data.csv", header=True, inferSchema='True')
colsDrop = ("data_index", "_c17", "song_title", "artist")
df = dataset.drop(*colsDrop)
rdd = sc.parallelize(df.collect()).partitionBy(8)
Here .partitionBy() allows you to control the partition number of an RDD object. It is also possible to find out these numbers with the .getNumPartition() method.
The only thing to note is that giving the number of partitions more than the number of threads on the CPU will not give us speed gain.
For example, the amount of threads in my CPU is 8, you can see a sample time distribution below.
As you can see, I can't gain speed after 8 partitions.
Upvotes: 0
Reputation: 18022
As other user said you can set the minimal number of partitions that will be created, while reading the file, by setting it in the optional parameter minPartitions
of textFile.
rating_data_raw = sc.textFile("/<path_to_csv_file>.csv", minPartitions=128)
Another way to achieve this is by using repartition or coalesce, if you need to reduce the number of partition you may use coalesce
, otherwise you can use repartition
.
rating_data_raw = sc.textFile("/<path_to_csv_file>.csv").repartition(128)
Upvotes: 5