Reputation: 3830
I have a data source that consists of a huge amount of small files. I would like to save this partitioned by column user_id
to another storage:
sdf = spark.read.json("...")
sdf.write.partitionBy("user_id").json("...")
The reason for this is I want another system to be able to delete only select users' data upon request.
This works, but, I still get many files within each partition (due to my input data). For performance reasons I would like to reduce the number of files within each partition, ideally simply to one (the process will run each day, so having an output file per user per day would work well).
How do I obtain this with pyspark?
Upvotes: 0
Views: 954
Reputation: 5526
You can use repartition to ensure that each partition gets one file
sdf.repartition('user_id').write.partitionBy("user_id").json("...")
This will make sure for each partition one file is created but in case of coalesce if there are more than one partition it can cause trouble.
Upvotes: 1
Reputation: 4045
Just add coalesce
and no. of file you want.
sdf.coalesce(1).write.partitionBy("user_id").json("...")
Upvotes: 1