Reputation: 11577
I have a Spark Streaming application that writes its output to HDFS.
What precautions and strategies can I take to ensure that not too many small files are generated by this process and create a memory pressure in the HDFS Namenode. Does Apache Spark provides any pre-built solutions to avoid small files in HDFS.
Upvotes: 2
Views: 1489
Reputation: 1751
I know this question is old, but may be useful for someone in the future.
Another option is to use coalesce
with a smaller number of partitions. coalesce
merges partitions together and creates larger partitions. This can increase the processing time of the streaming batch because of the reduction in number of partitions during the write, but it will help in reducing the number of files.
This will reduce the parallelism, hence having too few partitions can cause issues to the Streaming job. You will have to test with different values of partitions for coalesce
to find which value works best in your case.
Upvotes: 0
Reputation: 1040
You can reduce the number of part files . By default spark generates output in 200 part files . You can decrease the number of part files .
Upvotes: -1
Reputation: 657
Another solution is also to get another Spark application that reaggregates the small files every hour/day/week,etc.
Upvotes: 2
Reputation: 2934
No. Spark do not provide any such solution.
What you can do:
Upvotes: 2