Reputation: 323
I am trying to use clustering configurations in Hudi COW table to keep only a single file in the partition folders if the total partition data size is less than 128 MB. But it seems that clustering is not working with bulk_insert as expected. We have few tables in TBs(20TB, 7TB, 3TB) with partition count as 77000. Please find below options which we tried. We are running our pyspark job in EMR serverless 6.8.0.
Hudi write mode as "bulk_insert" mode with below clustering configs.
hoodie.clustering.inline=true
hoodie.clustering.plan.strategy.target.file.max.bytes=1073741824
hoodie.clustering.plan.strategy.small.file.limit=134217728
hoodie.clustering.plan.strategy.sort.columns=columnA,ColumnB
hoodie.clustering.inline.max.commits=4
Result: Output partition has 26 files of size around 800KB/file
Hudi write mode as "bulk_insert" and removed all the clustering configurations.
Result: Output partition has 26 files of size around 800KB/file
Hudi write mode as "insert" mode with below clustering configs.
hoodie.clustering.inline=true
hoodie.clustering.plan.strategy.target.file.max.bytes=1073741824
hoodie.clustering.plan.strategy.small.file.limit=134217728
hoodie.clustering.plan.strategy.sort.columns=columnA,ColumnB
hoodie.clustering.inline.max.commits=4.
Result: Output partition has only 1 file which is of size 11MB
Hudi write mode as "insert" and removed all the clustering configurations.
Result: Ouput partition has only 1 file which is of size 11MB
Tried below hudi configurations as well, but still the same above results.
hoodie.parquet.max.file.size=125829120
hoodie.parquet.small.file.limit=104857600
hiidie.clustering.plan.strategy.small.file.limit=600
hoodie.clustering.async.enabled=true
hoodie.clustering.async.max.commits=4
hoodie.clustering.plan.strategy.max.bytes.per.group=1073741824
It seems clustering is not applied while bulk_insert mode, but it is applied in insert mode, Does anybody can tell this is the right approach or I am doing anything wrong here... Your help is highly appreciated.
Upvotes: 1
Views: 443
Reputation: 1152
Did you wait 4 commits to let the clustering service be triggered according to hoodie.clustering.inline.max.commits=4
One common misunderstanding is in hudi you first write the data w/o clustering and then it will trigger based on rules to rewrite the files.
Then in your case likely the clustering never happened. The reason insert produces larger files is this operation is designed for that, while bulk_insert just uses spark vanilla mecanims to write ; you still able to use coalesce to produce smaller files.
If you d'like to use bulk insert and apply custom transformation such clustering, sorting you can use your own logic in a custom partitioner see https://hudi.apache.org/docs/configurations/#hoodiebulkinsertuserdefinedpartitionerclass That way you would write the parquet files directly right without having the need to rely on a service to compact the files.
Upvotes: 0