Reputation: 73
I am writing a partitioned output using the below script.
.write
.format("csv")
.partitionBy("date","region")
.option("delimiter", "\t")
.mode("overwrite")
.save("s3://mybucket/myfolder/")
However this results in 1 file under each partition. I would like to have multiple similar sized files under each partition. How can I achieve the same. I am on spark 2.2.
I tried using additional key as part of repartition like df_input_table.repartition($"region",$"date",$"region")
. However that leads in different sized files.
I would like to stick to spark (instead of Hive).
Upvotes: 1
Views: 2249
Reputation: 5881
Repartition is pretty expensive because it shuffles the data across the networks. Limiting the max number of records written per file is highly desirable. It can avoid generating huge files. In the next release, Spark provides two methods for users to set the limit.
// Method 1: specify the limit in the option of DataFrameWriter API.
df.write.option("maxRecordsPerFile", 1000)
.mode("overwrite").parquet(outputDirectory)
// Method 2: specify the limit via setting the session-scoped SQLConf configuration.
spark.conf.set("spark.sql.files.maxRecordsPerFile", 1000)
df.write.mode("overwrite").parquet(outputDirectory)
example - if your data frame has 10,000 records and you give maxRecordsPerFile = 1000 then spark will create 10 files with the same number of rows.
Upvotes: 1
Reputation: 69
.orderBy("date","region")
.repartition(10)
.write
.format("csv")
.option("delimiter", "\t")
.mode("overwrite")
.save("s3://mybucket/myfolder/")
you will get 10 almost similar sized files.
Upvotes: 0
Reputation: 1642
You can not control the size of output files in spark.
repartition does not guarantee the size it only creates files based on keys lets say if you have file that contains 6 rows with keys A(5 rows) and B(1 row) and you set repartitions to 2 . it will create 2 file one with 5 rows and other file with only 1 row.
You can try this solution instead How do you control the size of the output file?
Upvotes: 0