Reputation: 6614
I'm trying to create an AWS Glue job to test Apache Iceberg. I'm using the default tutorial here. I am getting the error "Failed to connect to Hive Metastore".
Other posts on stackOverflow with this error, they want Hive. I do not want Hive. I want to use the AWS Glue Catalog. I have zero references to hive anywhere in my script. Why is AWS glue still looking for Hive?
Here's my code:
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
from awsglue.context import GlueContext
DB_NAME='default'
CATALOG_NAME="glue_catalog" #The AWS Glue Data Catalog is pre-configured for use by the Spark libraries as glue_catalog.
TABLE_NAME = "table1"
conf = (SparkConf().setAppName("Spark Test")
.set("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
.set(f"spark.sql.catalog.{CATALOG_NAME}", "org.apache.iceberg.spark.SparkCatalog") #this does not enable iceberg support?
.set(f"spark.sql.catalog.{CATALOG_NAME}.warehouse", "s3://<your-warehouse-dir>")
.set(f"spark.sql.catalog.{CATALOG_NAME}.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
.set(f"spark.sql.catalog.{CATALOG_NAME}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
.set(f"spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog")
)
#create spark runner
spark = ( SparkSession
.builder
.appName("Python Spark Iceberg example")
.config(conf=conf)
.getOrCreate()
)
glueContext = GlueContext(spark.sparkContext.getOrCreate()) #not sure what this is for but I tried it, no difference seen.
#directly from https://docs.aws.amazon.com/prescriptive-guidance/latest/apache-iceberg-on-aws/iceberg-spark.html
glueContext.sql(f"""
CREATE TABLE IF NOT EXISTS {CATALOG_NAME}.{DB_NAME}.{TABLE_NAME}_nopartitions (
c_customer_sk int,
c_customer_id string,
c_first_name string,
c_last_name string,
c_birth_country string,
c_email_address string)
USING iceberg
OPTIONS ('format-version'='2')
""")
#### ERROR "Failed to connect to Hive Metastore" #####
#######################################################
glueContext.sql(f"""
INSERT INTO {CATALOG_NAME}.{DB_NAME}.{TABLE_NAME}_nopartitions
SELECT c_customer_sk, c_customer_id, c_first_name, c_last_name, c_birth_country, c_email_address
FROM another_table
""")
Error logs for those who ask:
2025-01-30 20:53:44,898 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(77)): Error from Python:Traceback (most recent call last):
File "/tmp/main.py", line 85, in <module>
spark.sql(f"""
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 1034, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self)
File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
return_value = get_return_value(
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco
return f(*a, **kw)
File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o104.sql.
: org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive Metastore
at org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:84)
what I tried:
--enable-glue-datacatalog = true
did nothing.enableHiveSupport()
did nothingSparkSessionCatalog
glueSession.sql
and spark.sql
everywhereI'm stumped. I'm not using Hive. What's going on?
Any insight is appreciated.
Upvotes: 0
Views: 113