Jon
Jon

Reputation: 11

Dataproc serverless does not seem to make use of spark property to connect to external hive metastore

I have a GCP postgres instance that serves as an external hive metastore for a Dataproc cluster. I would like to be able to utilize this metastore for Dataproc serverless jobs. Experimenting with serverless and by following documentation, I am already able to:

I thought the spark property "spark.hadoop.hive.metastore.uris" would allow serverless spark jobs to connect to the thrift server used by the Dataproc cluster, but it does not seem to even try to make the connection and instead errors with:

Required table missing : "DBS" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"

The non-serverless Dataproc spark jobs log:

INFO hive.metastore: Trying to connect to metastore with URI thrift://cluster-master-node:9083

as it successfully makes connection.

The jobs are initiated via Airflow with a custom class extending DataprocCreateBatchOperator, using a batch configuration like-so:

{
    "spark_batch": {
        "jar_file_uris": [
            "gs://bucket/path/to/jarFileObject.jar"
        ],
        "main_class": "com.package.MainClass",
        "args": [
            "--args=for",
            "--spark=job"
        ]
    },
    "runtime_config": {
        "version": "1.1",
        "properties": {
            "spark.hadoop.hive.metastore.uris": "thrift://cluster-master-node:9083",
            "spark.sql.warehouse.dir": "gs://bucket/warehosue/dir",
            "spark.hadoop.metastore.catalog.default":"hive_metastore"
        },
        "container_image": "gcr.io/PROJECT_ID/path/image:latest"
    },
    "environment_config": {
        "execution_config": {
            "service_account": "service_account",
            "subnetwork_uri": "subnetwork_uri"
        },
        "peripherals_config": {
            "spark_history_server_config": {
                "dataproc_cluster": "projects/PROJECT_ID/regions/REGION/clusters/cluster-name"
            }
        }
    },
    "labels": {
        "job_name": "job_name"
    }
}

Upvotes: 1

Views: 814

Answers (1)

Igor Dvorzhak
Igor Dvorzhak

Reputation: 4457

To manually configure Hive for Spark, you need to use spark.hive., not spark.hadoop. prefix for Hive properties:

spark.hive.metastore.uris=thrift://...
spark.hive.metastore.warehouse.dir=gs://...
spark.sql.catalogImplementation=hive

Upvotes: 0

Related Questions