john k
john k

Reputation: 6614

AWS Glue Iceberg "Failed to connect to Hive Metastore" - but I'm not using Hive

I'm trying to create an AWS Glue job to test Apache Iceberg. I'm using the default tutorial here. I am getting the error "Failed to connect to Hive Metastore".

Other posts on stackOverflow with this error, they want Hive. I do not want Hive. I want to use the AWS Glue Catalog. I have zero references to hive anywhere in my script. Why is AWS glue still looking for Hive?

Here's my code:

from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
from awsglue.context import GlueContext

DB_NAME='default'
CATALOG_NAME="glue_catalog"  #The AWS Glue Data Catalog is pre-configured for use by the Spark libraries as glue_catalog.  
TABLE_NAME = "table1"

conf = (SparkConf().setAppName("Spark Test") 
    .set("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .set(f"spark.sql.catalog.{CATALOG_NAME}", "org.apache.iceberg.spark.SparkCatalog")  #this does not enable iceberg support?
    .set(f"spark.sql.catalog.{CATALOG_NAME}.warehouse", "s3://<your-warehouse-dir>")
    .set(f"spark.sql.catalog.{CATALOG_NAME}.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") 
    .set(f"spark.sql.catalog.{CATALOG_NAME}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
    .set(f"spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog")  
    )


#create spark runner
spark = ( SparkSession 
    .builder 
   .appName("Python Spark Iceberg example") 
   .config(conf=conf) 
   .getOrCreate() 
        )
        
glueContext = GlueContext(spark.sparkContext.getOrCreate())  #not sure what this is for but I tried it, no difference seen.

#directly from https://docs.aws.amazon.com/prescriptive-guidance/latest/apache-iceberg-on-aws/iceberg-spark.html
glueContext.sql(f"""
    CREATE TABLE IF NOT EXISTS {CATALOG_NAME}.{DB_NAME}.{TABLE_NAME}_nopartitions (
        c_customer_sk             int,
        c_customer_id             string,
        c_first_name              string,
        c_last_name               string,
        c_birth_country           string,
        c_email_address           string)
    USING iceberg
    OPTIONS ('format-version'='2')
""")

#### ERROR "Failed to connect to Hive Metastore"  #####
#######################################################

glueContext.sql(f"""
INSERT INTO {CATALOG_NAME}.{DB_NAME}.{TABLE_NAME}_nopartitions
SELECT c_customer_sk, c_customer_id, c_first_name, c_last_name, c_birth_country, c_email_address
FROM another_table
""")

Error logs for those who ask:

2025-01-30 20:53:44,898 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(77)): Error from Python:Traceback (most recent call last):
  File "/tmp/main.py", line 85, in <module>
    spark.sql(f"""
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 1034, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self)
  File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco
    return f(*a, **kw)
  File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o104.sql.
: org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive Metastore
    at org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:84)

what I tried:

I'm stumped. I'm not using Hive. What's going on?

Any insight is appreciated.

Upvotes: 0

Views: 113

Answers (0)

Related Questions