Reputation: 1
Our data loads into hdfs with partition columns as date daily. The issue is each partition has small file size less than 50mb. So when we read the data from all these partition to load the data to next table take hours. How can we address this issue?
Upvotes: 0
Views: 407
Reputation: 29185
I'd suggest you to run end of the day job to coalesce/combine and make a large file which is significantly bigger in size for processing in spark, before reading from spark.
Further reading cloudera blog/docs to address these problems Partition Management in Hadoop where several techniques were discussed to address these problems like
Select one of the technique discussed in cloudera blog to match your kind of requirements. Hope this helps!
Other good options Typical use case is using open source delta lake/ if you are using databricks go for their delta lake for getting rich set of features...
Example maven coordinates.
<dependency>
<groupId>io.delta</groupId>
<artifactId>delta-core_2.11</artifactId>
<version>0.6.1</version>
</dependency>
using delta. lake you can insert/update/delete the data as you want. it will reduce maintenance steps...
Compacting Small Files in Delta Lakes
Upvotes: 0