hiits100rav
hiits100rav

Reputation: 73

Split files under partitions in spark

I am writing a partitioned output using the below script.

    .write
    .format("csv")
    .partitionBy("date","region")
    .option("delimiter", "\t")
    .mode("overwrite")
    .save("s3://mybucket/myfolder/")

However this results in 1 file under each partition. I would like to have multiple similar sized files under each partition. How can I achieve the same. I am on spark 2.2.

I tried using additional key as part of repartition like df_input_table.repartition($"region",$"date",$"region"). However that leads in different sized files.

I would like to stick to spark (instead of Hive).

Upvotes: 1

Views: 2249

Answers (3)

Kishore
Kishore

Reputation: 5881

Repartition is pretty expensive because it shuffles the data across the networks. Limiting the max number of records written per file is highly desirable. It can avoid generating huge files. In the next release, Spark provides two methods for users to set the limit.

// Method 1: specify the limit in the option of DataFrameWriter API. 
df.write.option("maxRecordsPerFile", 1000)
  .mode("overwrite").parquet(outputDirectory)
// Method 2: specify the limit via setting the session-scoped SQLConf configuration. 
spark.conf.set("spark.sql.files.maxRecordsPerFile", 1000)
df.write.mode("overwrite").parquet(outputDirectory)

example - if your data frame has 10,000 records and you give maxRecordsPerFile = 1000 then spark will create 10 files with the same number of rows.

Upvotes: 1

annld
annld

Reputation: 69

.orderBy("date","region")
.repartition(10)
.write
.format("csv")
.option("delimiter", "\t")
.mode("overwrite")
.save("s3://mybucket/myfolder/")

you will get 10 almost similar sized files.

Upvotes: 0

Strick
Strick

Reputation: 1642

You can not control the size of output files in spark.

repartition does not guarantee the size it only creates files based on keys lets say if you have file that contains 6 rows with keys A(5 rows) and B(1 row) and you set repartitions to 2 . it will create 2 file one with 5 rows and other file with only 1 row.

You can try this solution instead How do you control the size of the output file?

Upvotes: 0

Related Questions