giri rajh
giri rajh

Reputation: 47

Write out spark df as single parquet file in databricks

I have a dataframe something like below:

Filename col1 col2
file1 1 1
file1 1 1
file2 2 2
file2 2 2

I need to save this as parquet partitioned by file name. When I use df.write.partitionBy("Filename").mode("overwrite").parquet(file_out_location) it creates 2 folders (based on the partitions) as Filename=file1 and Filename=file1 and many part files inside.

How can I save it as a single file within each partition directory, e.g. Filename=file1.parquet and Filename=file2.parquet?

Upvotes: 0

Views: 2800

Answers (1)

Ronak Jain
Ronak Jain

Reputation: 3348

This would work:

row = df.selectExpr("cast(count(DISTINCT(FileName)) as int) as cnt").head();

df \
  .repartition(row["cnt"], F.col("FileName"))\
  .write()\
  .partitionBy("FileName")\
  .parquet("output-folder-path/");

Essentially you need to partition the in-memory dataframe based on the same column(s) which you intent on using in partitionBy(). Without giving row["cnt"] as above - it'll default to spark.sql.shuffle.partitions partitions.

The above will produce one file per partition based on the partition column.

Without repartition: Output1

With repartition: Output

Upvotes: 2

Related Questions