Spark Streaming creating many small files

Question

I have implemented a Spark Streaming job which streams the events received into HDFS for the past 6 months.

It is creating many small files in HDFS and I would like them each of the file size to be of 128 MB (block size) of the HDFS.

If I were to use append mode, all the data would be written to one parquet file instead.

How do I configure Spark to create a new HDFS parquet file for every 128 MB of data?

afeldman · Accepted Answer

Spark will write as many files as partitions on the object before write. It can be really inefficient. To reduce total number of part files try this, it checks the total byte size of object and reprtions it to +1 the optimal size.

import org.apache.spark.util.SizeEstimator

val inputDF2 : Long = SizeEstimator.estimate(inputDF.rdd)
//find its appropiate number of partitions
val numPartitions : Long = (inputDF2/134217728) + 1
 //write it out with that many partitions
 val outputDF = inputDF.repartition(numPartitions.toInt)

Spark Streaming creating many small files

Answers (1)

Related Questions