Unable to load Trino Spark connector in AWS Glue job

Question

I'm trying to use the Trino Spark connector in an AWS Glue 4.0 job, but I'm running into issues loading the connector.

Relevant code in my glue job submission script:

trino_drivers_list = list_s3_files(s3_bucket, "bd_cap/trino-drivers/", s3_client)
trino_drivers = ",".join(trino_drivers_list)

spark_configurations = [
  
   f"spark.sql.extensions=io.trino.spark.TrinoSQLExtension",
    f"spark.sql.catalog.trino=io.trino.spark.TrinoCatalog",
    f'spark.sql.defaultCatalog=mytrinocatalog',
    f"spark.sql.catalog.mytrinocatalog.type=hive",
  
 
    f"spark.logConf=true",
    f"spark.driver.log.level=DEBUG",
    f"spark.executor.log.level=DEBUG",
    f"spark.driver.logClassPath=true",
    f"spark.executor.logClassPath=true"]

spark_string = ' --conf '.join(spark_configurations)

    job_args = {
        'Description': description,
        'Role': 'AWSGlueServiceRole',
        'ExecutionProperty': {
            "MaxConcurrentRuns": 3
        },
        'Command': {
            "Name": "glueetl",
            "ScriptLocation": script_path,
            "PythonVersion": "3"
        },
        'GlueVersion': '4.0',
        'WorkerType': 'Standard',
        'NumberOfWorkers': 1,
        'DefaultArguments': {
        "--extra-py-files": ",".join(extra_py_files),
        "--additional-python-modules":python_modules,
        "--jars": trino_drivers,
        "--conf": spark_string }
    }

Here's the relevant code in my script I'd like Glue to run:

trino_drivers = [
    "s3://my_s3bucket/bd_cap/trino-drivers/guava-30.1-jre.jar",
    "s3://my_s3bucket/bd_cap/trino-drivers/jackson-annotations-2.12.3.jar",
    "s3://my_s3bucket/bd_cap/trino-drivers/jackson-core-2.12.3.jar",
    "s3://my_s3bucket/bd_cap/trino-drivers/jackson-databind-2.12.3.jar",
    "s3://my_s3bucket/bd_cap/trino-drivers/log-0.197.jar",
    "s3://my_s3bucket/bd_cap/trino-drivers/slf4j-api-1.7.30.jar",
    "s3://my_s3bucket/bd_cap/trino-drivers/slf4j-nop-1.7.30.jar",
    "s3://my_s3bucket/bd_cap/trino-drivers/trino-jdbc-469.jar"
]

 spark = SparkSession.builder \
    .appName("Trino Writer") \
    .config("spark.sql.catalog.trino", "org.apache.spark.sql.trino.TrinoCatalog") \
    .config("spark.sql.catalog.trino.uri", trino_url) \
    .config("spark.sql.catalog.trino.user", trino_user) \
    .config("spark.sql.catalog.trino.password", trino_password) \
    .config("spark.jars", ",".join(trino_drivers)) \
    .getOrCreate()

Error:

py4j.protocol.Py4JJavaError: An error occurred while calling o1904.cache.
: org.apache.spark.sql.connector.catalog.CatalogNotFoundException: Catalog 'mytrinocatalog' plugin class not found: spark.sql.catalog.mytrinocatalog is not defined
    at org.apache.spark.sql.er  rors.QueryExecutionErrors$.catalogPluginClassNotFoundError(QueryExecutionErrors.scala:1608)

I'm at wits end with this error. Any help is much appreciated!

Here are the steps I've taken so far:

I've uploaded the required Trino jars to an S3 bucket including trino-jdbc-469.jar, guava-30.1-jre.jar, jackson-annotations-2.12.3.jar, jackson-core-2.12.3.jar, jackson-databind-2.12.3.jar, log-0.197.jar, slf4j-api-1.7.30.jar, and slf4j-nop-1.7.30.jar.
I've checked for dependency conflicts using mvn dependency:tree. No conflicts were found.
I've configured the Glue job to use the spark.jars property to load the jars from S3.
I've also tried using the --extra-jars option and the spark.jars.packages property to load the Trino Spark connector.
I've verified that the jars are being loaded by logging the loaded jars using spark.sparkContext._jvm.ClassLoader.getSystemResources("").
I've tried setting the spark.driver.extraClassPath and spark.executor.extraClassPath properties to include the Trino jars:

"--conf spark.driver.extraClassPath": "/opt/amazon/conf:/opt/amazon/glue-manifest.jar:" + ":".join(trino_drivers),
"--conf spark.executor.extraClassPath": "/opt/amazon/conf:/opt/amazon/glue-manifest.jar:" + ":".join(trino_drivers)

Setting the spark.executor.extraJavaOptions and spark.driver.extraJavaOptions properties did not work either:

"--conf spark.executor.extraJavaOptions": "-Djava.class.path=/opt/amazon/conf:/opt/amazon/glue-manifest.jar:" + ":".join(trino_drivers),
"--conf spark.driver.extraJavaOptions": "-Djava.class.path=/opt/amazon/conf:/opt/amazon/glue-manifest.jar:" + ":".join(trino_drivers)

Despite these efforts, I'm still getting a ClassNotFoundException when trying to use the Trino Spark connector. An LLM I used mentioned that spark.driver.class.path and spark.executor.class.path are read-only properties in Spark, and cannot be set directly. As an attempted work around I've also tried setting the CLASSPATH directly in my main script but get the same error:

import os
os.environ['CLASSPATH'] = '/opt/amazon/conf:/opt/amazon/glue-manifest.jar:' + ':'.join(trino_drivers)

Unable to load Trino Spark connector in AWS Glue job

Answers (0)

Related Questions