Reputation: 37
what is wrong in this config ,
partition keys are not working in HUDI as well as all the records get updated in the hudi dataset while doing the upsert . so couldnt extract the delta from the tables.
commonConfig = {'className' : 'org.apache.hudi',
'hoodie.datasource.hive_sync.use_jdbc':'false',
'hoodie.datasource.write.precombine.field': 'hash_value',
'hoodie.datasource.write.recordkey.field': 'hash_value',
'hoodie.datasource.hive_sync.partition_fields':'year,month,day',
'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
'hoodie.datasource.write.keygenerator.class':'org.apache.hudi.ComplexKeyGenerator',
'hoodie.table.name': 'hudi_account',
'hoodie.consistency.check.enabled': 'true',
'hoodie.datasource.hive_sync.database': 'hudi_db',
'hoodie.datasource.hive_sync.table': 'hudi_account',
'hoodie.datasource.hive_sync.enable': 'true',
'path': 's3://' + args['curated_bucket'] + '/stage_e/hudi_db/hudi_account'}
My usecase is to complete the upsert logic using hudi and partition using hudi . Upsert is partially working as it updates the entire recordset as like if i have 10k records in the raw bucket, while doing the upsert for 1k records , it updates the hudi time for all the 10k data.
Upvotes: 1
Views: 3582
Reputation: 21
Did your partition keys change? By default hudi doesn't use global indexes, but per partition, I was having problems similar to yours, when I enabled global index it worked. Try adding these settings:
"hoodie.index.type": "GLOBAL_BLOOM", # This is required if we want to ensure we upsert a record, even if the partition changes
"hoodie.bloom.index.update.partition.path": "true", # This is required to write the data into the new partition (defaults to false in 0.8.0, true in 0.9.0)
I found the answer on this blog: https://dacort.dev/posts/updating-partition-values-with-apache-hudi/
Here you can see more information about hudi indexes: https://hudi.apache.org/blog/2020/11/11/hudi-indexing-mechanisms/
Upvotes: 2