ilovetolearn
ilovetolearn

Reputation: 2060

Spark Streaming creating many small files

I have implemented a Spark Streaming job which streams the events received into HDFS for the past 6 months.

It is creating many small files in HDFS and I would like them each of the file size to be of 128 MB (block size) of the HDFS.

If I were to use append mode, all the data would be written to one parquet file instead.

How do I configure Spark to create a new HDFS parquet file for every 128 MB of data?

Upvotes: 2

Views: 1481

Answers (1)

afeldman
afeldman

Reputation: 512

Spark will write as many files as partitions on the object before write. It can be really inefficient. To reduce total number of part files try this, it checks the total byte size of object and reprtions it to +1 the optimal size.

import org.apache.spark.util.SizeEstimator

val inputDF2 : Long = SizeEstimator.estimate(inputDF.rdd)
//find its appropiate number of partitions
val numPartitions : Long = (inputDF2/134217728) + 1
 //write it out with that many partitions
 val outputDF = inputDF.repartition(numPartitions.toInt)

Upvotes: 2

Related Questions