rogue-one
rogue-one

Reputation: 11577

Spark Streaming: avoid small files in HDFS

I have a Spark Streaming application that writes its output to HDFS.

What precautions and strategies can I take to ensure that not too many small files are generated by this process and create a memory pressure in the HDFS Namenode. Does Apache Spark provides any pre-built solutions to avoid small files in HDFS.

Upvotes: 2

Views: 1489

Answers (4)

Jegan
Jegan

Reputation: 1751

I know this question is old, but may be useful for someone in the future.

Another option is to use coalesce with a smaller number of partitions. coalesce merges partitions together and creates larger partitions. This can increase the processing time of the streaming batch because of the reduction in number of partitions during the write, but it will help in reducing the number of files.

This will reduce the parallelism, hence having too few partitions can cause issues to the Streaming job. You will have to test with different values of partitions for coalesce to find which value works best in your case.

Upvotes: 0

Sandeep Das
Sandeep Das

Reputation: 1040

You can reduce the number of part files . By default spark generates output in 200 part files . You can decrease the number of part files .

Upvotes: -1

Nastasia
Nastasia

Reputation: 657

Another solution is also to get another Spark application that reaggregates the small files every hour/day/week,etc.

Upvotes: 2

Vladislav Varslavans
Vladislav Varslavans

Reputation: 2934

No. Spark do not provide any such solution.

What you can do:

  1. Increase batch interval - this will not guarantee anything - but still there is higher chance. Though the tradeoff here is that streaming will have bigger latency.
  2. Manually manage it. For example - on each batch you could calculate size of the RDD and accumulate RDDs unless they satisfy your size requirement. Then you just union RDDs and write to disk. This will unpredictably increase latency, but will guarantee efficient space usage.

Upvotes: 2

Related Questions