Neo
Neo

Reputation: 4478

How to re-partition pyspark dataframe?

data.rdd.getNumPartitions() # output 2456

Then I do
data.rdd.repartition(3000) But
data.rdd.getNumPartitions() # output is still 2456

How to change number of partitions. One approach can be first convert DF into rdd,repartition it and then convert rdd back to DF. But this takes a lot of time. Also does increasing number of partitions make operations more distributed and so more fast? Thanks

Upvotes: 16

Views: 47086

Answers (3)

Giorgos Myrianthous
Giorgos Myrianthous

Reputation: 39830

If you want to increase the number of partitions, you can use repartition():

data = data.repartition(3000)

If you want to decrease the number of partitions, I would advise you to use coalesce(), that avoids full shuffle:

Useful for running operations more efficiently after filtering down a large dataset.

data = data.coalesce(10)

For more details check the article How to efficiently re-partition Spark DataFrames

Upvotes: 8

Ali Payne
Ali Payne

Reputation: 159

print df.rdd.getNumPartitions()
# 1


df.repartition(5)
print df.rdd.getNumPartitions()
# 1


df = df.repartition(5)
print df.rdd.getNumPartitions()
# 5

see Spark: The definitive Guide chapter 5- Basic Structure Operations
ISBN-13: 978-1491912218
ISBN-10: 1491912219

Upvotes: 13

Michel Lemay
Michel Lemay

Reputation: 2094

You can check the number of partitions:

data.rdd.partitions.size

To change the number of partitions:

newDF = data.repartition(3000)

You can check the number of partitions:

newDF.rdd.partitions.size

Beware of data shuffle when repartitionning and this is expensive. Take a look at coalesce if needed.

Upvotes: 24

Related Questions