user100855
user100855

Reputation: 11

Unable to load Trino Spark connector in AWS Glue job

I'm trying to use the Trino Spark connector in an AWS Glue 4.0 job, but I'm running into issues loading the connector.

Relevant code in my glue job submission script:

trino_drivers_list = list_s3_files(s3_bucket, "bd_cap/trino-drivers/", s3_client)
trino_drivers = ",".join(trino_drivers_list)

spark_configurations = [
  
   f"spark.sql.extensions=io.trino.spark.TrinoSQLExtension",
    f"spark.sql.catalog.trino=io.trino.spark.TrinoCatalog",
    f'spark.sql.defaultCatalog=mytrinocatalog',
    f"spark.sql.catalog.mytrinocatalog.type=hive",
  
 
    f"spark.logConf=true",
    f"spark.driver.log.level=DEBUG",
    f"spark.executor.log.level=DEBUG",
    f"spark.driver.logClassPath=true",
    f"spark.executor.logClassPath=true"]

spark_string = ' --conf '.join(spark_configurations)

    job_args = {
        'Description': description,
        'Role': 'AWSGlueServiceRole',
        'ExecutionProperty': {
            "MaxConcurrentRuns": 3
        },
        'Command': {
            "Name": "glueetl",
            "ScriptLocation": script_path,
            "PythonVersion": "3"
        },
        'GlueVersion': '4.0',
        'WorkerType': 'Standard',
        'NumberOfWorkers': 1,
        'DefaultArguments': {
        "--extra-py-files": ",".join(extra_py_files),
        "--additional-python-modules":python_modules,
        "--jars": trino_drivers,
        "--conf": spark_string }
    }

Here's the relevant code in my script I'd like Glue to run:

trino_drivers = [
    "s3://my_s3bucket/bd_cap/trino-drivers/guava-30.1-jre.jar",
    "s3://my_s3bucket/bd_cap/trino-drivers/jackson-annotations-2.12.3.jar",
    "s3://my_s3bucket/bd_cap/trino-drivers/jackson-core-2.12.3.jar",
    "s3://my_s3bucket/bd_cap/trino-drivers/jackson-databind-2.12.3.jar",
    "s3://my_s3bucket/bd_cap/trino-drivers/log-0.197.jar",
    "s3://my_s3bucket/bd_cap/trino-drivers/slf4j-api-1.7.30.jar",
    "s3://my_s3bucket/bd_cap/trino-drivers/slf4j-nop-1.7.30.jar",
    "s3://my_s3bucket/bd_cap/trino-drivers/trino-jdbc-469.jar"
]

 spark = SparkSession.builder \
    .appName("Trino Writer") \
    .config("spark.sql.catalog.trino", "org.apache.spark.sql.trino.TrinoCatalog") \
    .config("spark.sql.catalog.trino.uri", trino_url) \
    .config("spark.sql.catalog.trino.user", trino_user) \
    .config("spark.sql.catalog.trino.password", trino_password) \
    .config("spark.jars", ",".join(trino_drivers)) \
    .getOrCreate()

Error:

py4j.protocol.Py4JJavaError: An error occurred while calling o1904.cache.
: org.apache.spark.sql.connector.catalog.CatalogNotFoundException: Catalog 'mytrinocatalog' plugin class not found: spark.sql.catalog.mytrinocatalog is not defined
    at org.apache.spark.sql.er  rors.QueryExecutionErrors$.catalogPluginClassNotFoundError(QueryExecutionErrors.scala:1608)

I'm at wits end with this error. Any help is much appreciated!

Here are the steps I've taken so far:

"--conf spark.driver.extraClassPath": "/opt/amazon/conf:/opt/amazon/glue-manifest.jar:" + ":".join(trino_drivers),
"--conf spark.executor.extraClassPath": "/opt/amazon/conf:/opt/amazon/glue-manifest.jar:" + ":".join(trino_drivers)
"--conf spark.executor.extraJavaOptions": "-Djava.class.path=/opt/amazon/conf:/opt/amazon/glue-manifest.jar:" + ":".join(trino_drivers),
"--conf spark.driver.extraJavaOptions": "-Djava.class.path=/opt/amazon/conf:/opt/amazon/glue-manifest.jar:" + ":".join(trino_drivers)

Despite these efforts, I'm still getting a ClassNotFoundException when trying to use the Trino Spark connector. An LLM I used mentioned that spark.driver.class.path and spark.executor.class.path are read-only properties in Spark, and cannot be set directly. As an attempted work around I've also tried setting the CLASSPATH directly in my main script but get the same error:

import os
os.environ['CLASSPATH'] = '/opt/amazon/conf:/opt/amazon/glue-manifest.jar:' + ':'.join(trino_drivers)

Upvotes: 1

Views: 31

Answers (0)

Related Questions