Modifying number of output files per write-partition with spark

Question

I have a data source that consists of a huge amount of small files. I would like to save this partitioned by column user_id to another storage:

sdf = spark.read.json("...")
sdf.write.partitionBy("user_id").json("...")

The reason for this is I want another system to be able to delete only select users' data upon request.

This works, but, I still get many files within each partition (due to my input data). For performance reasons I would like to reduce the number of files within each partition, ideally simply to one (the process will run each day, so having an output file per user per day would work well).

How do I obtain this with pyspark?

Shubham Jain · Accepted Answer

You can use repartition to ensure that each partition gets one file

sdf.repartition('user_id').write.partitionBy("user_id").json("...")

This will make sure for each partition one file is created but in case of coalesce if there are more than one partition it can cause trouble.

Modifying number of output files per write-partition with spark

Answers (2)

Related Questions