Dataproc serverless does not seem to make use of spark property to connect to external hive metastore

Question

I have a GCP postgres instance that serves as an external hive metastore for a Dataproc cluster. I would like to be able to utilize this metastore for Dataproc serverless jobs. Experimenting with serverless and by following documentation, I am already able to:

leverage the service account, subnetwork URI to access project resources
connect to PHS associated with the Dataproc cluster from a serverless spark batch
build and push a custom image to container registry to be pulled by spark jobs, inclusive of above functionalities

I thought the spark property "spark.hadoop.hive.metastore.uris" would allow serverless spark jobs to connect to the thrift server used by the Dataproc cluster, but it does not seem to even try to make the connection and instead errors with:

Required table missing : "DBS" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"

The non-serverless Dataproc spark jobs log:

INFO hive.metastore: Trying to connect to metastore with URI thrift://cluster-master-node:9083

as it successfully makes connection.

The jobs are initiated via Airflow with a custom class extending DataprocCreateBatchOperator, using a batch configuration like-so:

{
    "spark_batch": {
        "jar_file_uris": [
            "gs://bucket/path/to/jarFileObject.jar"
        ],
        "main_class": "com.package.MainClass",
        "args": [
            "--args=for",
            "--spark=job"
        ]
    },
    "runtime_config": {
        "version": "1.1",
        "properties": {
            "spark.hadoop.hive.metastore.uris": "thrift://cluster-master-node:9083",
            "spark.sql.warehouse.dir": "gs://bucket/warehosue/dir",
            "spark.hadoop.metastore.catalog.default":"hive_metastore"
        },
        "container_image": "gcr.io/PROJECT_ID/path/image:latest"
    },
    "environment_config": {
        "execution_config": {
            "service_account": "service_account",
            "subnetwork_uri": "subnetwork_uri"
        },
        "peripherals_config": {
            "spark_history_server_config": {
                "dataproc_cluster": "projects/PROJECT_ID/regions/REGION/clusters/cluster-name"
            }
        }
    },
    "labels": {
        "job_name": "job_name"
    }
}

Dataproc serverless does not seem to make use of spark property to connect to external hive metastore

Answers (1)

Related Questions