Reputation: 4478
data.rdd.getNumPartitions() # output 2456
Then I do
data.rdd.repartition(3000)
But
data.rdd.getNumPartitions()
# output is still 2456
How to change number of partitions. One approach can be first convert DF into rdd,repartition it and then convert rdd back to DF. But this takes a lot of time. Also does increasing number of partitions make operations more distributed and so more fast? Thanks
Upvotes: 16
Views: 47086
Reputation: 39830
If you want to increase the number of partitions, you can use repartition()
:
data = data.repartition(3000)
If you want to decrease the number of partitions, I would advise you to use coalesce()
, that avoids full shuffle:
Useful for running operations more efficiently after filtering down a large dataset.
data = data.coalesce(10)
For more details check the article How to efficiently re-partition Spark DataFrames
Upvotes: 8
Reputation: 159
print df.rdd.getNumPartitions()
# 1
df.repartition(5)
print df.rdd.getNumPartitions()
# 1
df = df.repartition(5)
print df.rdd.getNumPartitions()
# 5
see Spark: The definitive Guide chapter 5- Basic Structure Operations
ISBN-13: 978-1491912218
ISBN-10: 1491912219
Upvotes: 13
Reputation: 2094
You can check the number of partitions:
data.rdd.partitions.size
To change the number of partitions:
newDF = data.repartition(3000)
You can check the number of partitions:
newDF.rdd.partitions.size
Beware of data shuffle when repartitionning and this is expensive. Take a look at coalesce
if needed.
Upvotes: 24