Reputation: 731
While writing data to the Apache Hudi on EMR using PySpark, we can specify the configuration to save to a table name.
See
hudiOptions = {
'hoodie.table.name': 'tableName',
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.datasource.write.partitionpath.field': 'creation_date',
'hoodie.datasource.write.precombine.field': 'last_update_time',
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.hive_sync.table': 'tableName',
'hoodie.datasource.hive_sync.partition_fields': 'creation_date',
'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor'
}
# Write a DataFrame as a Hudi dataset
inputDF.write \
.format('org.apache.hudi') \
.option('hoodie.datasource.write.operation', 'insert') \
.options(**hudiOptions) \
.mode('overwrite') \
.save('s3://DOC-EXAMPLE-BUCKET/myhudidataset/')
I see the options hoodie.table.name
. In my case, I have written data from multiple tables into the same path, but with different tableNames.
However, when I query a dataframe from the base path like:
snapshotQueryDF = spark.read \
.format('org.apache.hudi') \
.load('s3://DOC-EXAMPLE-BUCKET/myhudidataset' + '/*/*')
snapshotQueryDF.show()
I get results from all the tables. Is there any way to filter data from only a particular tableName?
Searching through the configuration on Apache Hudi, I don't see anything that helps me.
Upvotes: 0
Views: 472
Reputation: 1
You can try using spark.read.table
. It will load the table's metadata from the metastore, as long as it is registered.
Your code would look like this:
snapshotQueryDF = spark.read.table('<database>.<table>')
snapshotQueryDF.show()
Upvotes: 0