Reputation: 734
After I read xml files using spark:
df = spark.read\
.format("xml")\
.options(**options)\
.load("s3a://.../.../")
I checked the number of partitions with df.rdd.getNumPartitions()
, and got 20081.
How do I limit the number of partitions at the start so I don't have to do a coalesce()
later? The issue with having so many partitions is due to each partition creates one file during df.write
, and 20081 new very small files in s3 each time this process runs is very bad practice.
Upvotes: 3
Views: 2420
Reputation: 6082
The number of partitions is calculated by DataSourceScanExec via a bit complex formula. However to simplify it, try to increase this value spark.sql.files.maxPartitionBytes
, it is 134217728
(128 MB) by default. Try to make it larger and you will see the difference.
spark.conf.set('spark.sql.files.maxPartitionBytes', '1073741824') # 1 GB
Upvotes: 1
Reputation: 18475
The resulting Dataframe of spark.read
will always match the number of partitions with the number of files because each file will be read by a dedicated task.
If you need to run this process more often I would rather have those original 20000 files consumed and copied once into lesser files using coalesce
or repartition
. Then, all subsequent reads of those files will result in a Dataframe with lesser partitions.
Upvotes: 1