casparjespersen
casparjespersen

Reputation: 3830

Modifying number of output files per write-partition with spark

I have a data source that consists of a huge amount of small files. I would like to save this partitioned by column user_id to another storage:

sdf = spark.read.json("...")
sdf.write.partitionBy("user_id").json("...")

The reason for this is I want another system to be able to delete only select users' data upon request.

This works, but, I still get many files within each partition (due to my input data). For performance reasons I would like to reduce the number of files within each partition, ideally simply to one (the process will run each day, so having an output file per user per day would work well).

How do I obtain this with pyspark?

Upvotes: 0

Views: 954

Answers (2)

Shubham Jain
Shubham Jain

Reputation: 5526

You can use repartition to ensure that each partition gets one file

sdf.repartition('user_id').write.partitionBy("user_id").json("...")

This will make sure for each partition one file is created but in case of coalesce if there are more than one partition it can cause trouble.

Upvotes: 1

QuickSilver
QuickSilver

Reputation: 4045

Just add coalesce and no. of file you want.

sdf.coalesce(1).write.partitionBy("user_id").json("...")

Upvotes: 1

Related Questions