Anurag A S
Anurag A S

Reputation: 731

Querying Apache Hudi using PySpark on EMR by table name

While writing data to the Apache Hudi on EMR using PySpark, we can specify the configuration to save to a table name.

See

hudiOptions = {
'hoodie.table.name': 'tableName',
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.datasource.write.partitionpath.field': 'creation_date',
'hoodie.datasource.write.precombine.field': 'last_update_time',
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.hive_sync.table': 'tableName',
'hoodie.datasource.hive_sync.partition_fields': 'creation_date',
'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor'
}

# Write a DataFrame as a Hudi dataset
inputDF.write \
.format('org.apache.hudi') \
.option('hoodie.datasource.write.operation', 'insert') \
.options(**hudiOptions) \
.mode('overwrite') \
.save('s3://DOC-EXAMPLE-BUCKET/myhudidataset/')

I see the options hoodie.table.name. In my case, I have written data from multiple tables into the same path, but with different tableNames.

However, when I query a dataframe from the base path like:

snapshotQueryDF = spark.read \
    .format('org.apache.hudi') \
    .load('s3://DOC-EXAMPLE-BUCKET/myhudidataset' + '/*/*')
    
snapshotQueryDF.show()

I get results from all the tables. Is there any way to filter data from only a particular tableName?

Searching through the configuration on Apache Hudi, I don't see anything that helps me.

Upvotes: 0

Views: 472

Answers (1)

You can try using spark.read.table. It will load the table's metadata from the metastore, as long as it is registered.

Your code would look like this:

snapshotQueryDF = spark.read.table('<database>.<table>')
    
snapshotQueryDF.show()

Upvotes: 0

Related Questions