Kevin Tianyu Xu
Kevin Tianyu Xu

Reputation: 734

Limit number of partitions for spark.read pyspark

After I read xml files using spark:

df = spark.read\
        .format("xml")\
        .options(**options)\
        .load("s3a://.../.../")

I checked the number of partitions with df.rdd.getNumPartitions(), and got 20081.

How do I limit the number of partitions at the start so I don't have to do a coalesce() later? The issue with having so many partitions is due to each partition creates one file during df.write, and 20081 new very small files in s3 each time this process runs is very bad practice.

Upvotes: 3

Views: 2420

Answers (2)

pltc
pltc

Reputation: 6082

The number of partitions is calculated by DataSourceScanExec via a bit complex formula. However to simplify it, try to increase this value spark.sql.files.maxPartitionBytes, it is 134217728 (128 MB) by default. Try to make it larger and you will see the difference.

spark.conf.set('spark.sql.files.maxPartitionBytes', '1073741824') # 1 GB

Upvotes: 1

Michael Heil
Michael Heil

Reputation: 18475

The resulting Dataframe of spark.read will always match the number of partitions with the number of files because each file will be read by a dedicated task.

If you need to run this process more often I would rather have those original 20000 files consumed and copied once into lesser files using coalesce or repartition. Then, all subsequent reads of those files will result in a Dataframe with lesser partitions.

Upvotes: 1

Related Questions