Reputation: 11
I have a GCP postgres instance that serves as an external hive metastore for a Dataproc cluster. I would like to be able to utilize this metastore for Dataproc serverless jobs. Experimenting with serverless and by following documentation, I am already able to:
I thought the spark property "spark.hadoop.hive.metastore.uris"
would allow serverless spark jobs to connect to the thrift server used by the Dataproc cluster, but it does not seem to even try to make the connection and instead errors with:
Required table missing : "DBS" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
The non-serverless Dataproc spark jobs log:
INFO hive.metastore: Trying to connect to metastore with URI thrift://cluster-master-node:9083
as it successfully makes connection.
The jobs are initiated via Airflow with a custom class extending DataprocCreateBatchOperator, using a batch configuration like-so:
{
"spark_batch": {
"jar_file_uris": [
"gs://bucket/path/to/jarFileObject.jar"
],
"main_class": "com.package.MainClass",
"args": [
"--args=for",
"--spark=job"
]
},
"runtime_config": {
"version": "1.1",
"properties": {
"spark.hadoop.hive.metastore.uris": "thrift://cluster-master-node:9083",
"spark.sql.warehouse.dir": "gs://bucket/warehosue/dir",
"spark.hadoop.metastore.catalog.default":"hive_metastore"
},
"container_image": "gcr.io/PROJECT_ID/path/image:latest"
},
"environment_config": {
"execution_config": {
"service_account": "service_account",
"subnetwork_uri": "subnetwork_uri"
},
"peripherals_config": {
"spark_history_server_config": {
"dataproc_cluster": "projects/PROJECT_ID/regions/REGION/clusters/cluster-name"
}
}
},
"labels": {
"job_name": "job_name"
}
}
Upvotes: 1
Views: 814
Reputation: 4457
To manually configure Hive for Spark, you need to use spark.hive.
, not spark.hadoop.
prefix for Hive properties:
spark.hive.metastore.uris=thrift://...
spark.hive.metastore.warehouse.dir=gs://...
spark.sql.catalogImplementation=hive
Upvotes: 0