Reputation: 47
I have a dataframe something like below:
Filename | col1 | col2 |
---|---|---|
file1 | 1 | 1 |
file1 | 1 | 1 |
file2 | 2 | 2 |
file2 | 2 | 2 |
I need to save this as parquet partitioned by file name. When I use df.write.partitionBy("Filename").mode("overwrite").parquet(file_out_location)
it creates 2 folders (based on the partitions) as Filename=file1
and Filename=file1
and many part files inside.
How can I save it as a single file within each partition directory, e.g. Filename=file1.parquet and Filename=file2.parquet?
Upvotes: 0
Views: 2800
Reputation: 3348
This would work:
row = df.selectExpr("cast(count(DISTINCT(FileName)) as int) as cnt").head();
df \
.repartition(row["cnt"], F.col("FileName"))\
.write()\
.partitionBy("FileName")\
.parquet("output-folder-path/");
Essentially you need to partition the in-memory dataframe based on the same column(s) which you intent on using in partitionBy()
. Without giving row["cnt"]
as above - it'll default to spark.sql.shuffle.partitions
partitions.
The above will produce one file per partition based on the partition column.
Upvotes: 2