Reputation: 11
I'm trying to use the Trino Spark connector in an AWS Glue 4.0 job, but I'm running into issues loading the connector.
Relevant code in my glue job submission script:
trino_drivers_list = list_s3_files(s3_bucket, "bd_cap/trino-drivers/", s3_client)
trino_drivers = ",".join(trino_drivers_list)
spark_configurations = [
f"spark.sql.extensions=io.trino.spark.TrinoSQLExtension",
f"spark.sql.catalog.trino=io.trino.spark.TrinoCatalog",
f'spark.sql.defaultCatalog=mytrinocatalog',
f"spark.sql.catalog.mytrinocatalog.type=hive",
f"spark.logConf=true",
f"spark.driver.log.level=DEBUG",
f"spark.executor.log.level=DEBUG",
f"spark.driver.logClassPath=true",
f"spark.executor.logClassPath=true"]
spark_string = ' --conf '.join(spark_configurations)
job_args = {
'Description': description,
'Role': 'AWSGlueServiceRole',
'ExecutionProperty': {
"MaxConcurrentRuns": 3
},
'Command': {
"Name": "glueetl",
"ScriptLocation": script_path,
"PythonVersion": "3"
},
'GlueVersion': '4.0',
'WorkerType': 'Standard',
'NumberOfWorkers': 1,
'DefaultArguments': {
"--extra-py-files": ",".join(extra_py_files),
"--additional-python-modules":python_modules,
"--jars": trino_drivers,
"--conf": spark_string }
}
Here's the relevant code in my script I'd like Glue to run:
trino_drivers = [
"s3://my_s3bucket/bd_cap/trino-drivers/guava-30.1-jre.jar",
"s3://my_s3bucket/bd_cap/trino-drivers/jackson-annotations-2.12.3.jar",
"s3://my_s3bucket/bd_cap/trino-drivers/jackson-core-2.12.3.jar",
"s3://my_s3bucket/bd_cap/trino-drivers/jackson-databind-2.12.3.jar",
"s3://my_s3bucket/bd_cap/trino-drivers/log-0.197.jar",
"s3://my_s3bucket/bd_cap/trino-drivers/slf4j-api-1.7.30.jar",
"s3://my_s3bucket/bd_cap/trino-drivers/slf4j-nop-1.7.30.jar",
"s3://my_s3bucket/bd_cap/trino-drivers/trino-jdbc-469.jar"
]
spark = SparkSession.builder \
.appName("Trino Writer") \
.config("spark.sql.catalog.trino", "org.apache.spark.sql.trino.TrinoCatalog") \
.config("spark.sql.catalog.trino.uri", trino_url) \
.config("spark.sql.catalog.trino.user", trino_user) \
.config("spark.sql.catalog.trino.password", trino_password) \
.config("spark.jars", ",".join(trino_drivers)) \
.getOrCreate()
Error:
py4j.protocol.Py4JJavaError: An error occurred while calling o1904.cache.
: org.apache.spark.sql.connector.catalog.CatalogNotFoundException: Catalog 'mytrinocatalog' plugin class not found: spark.sql.catalog.mytrinocatalog is not defined
at org.apache.spark.sql.er rors.QueryExecutionErrors$.catalogPluginClassNotFoundError(QueryExecutionErrors.scala:1608)
I'm at wits end with this error. Any help is much appreciated!
Here are the steps I've taken so far:
I've uploaded the required Trino jars to an S3 bucket including trino-jdbc-469.jar, guava-30.1-jre.jar, jackson-annotations-2.12.3.jar, jackson-core-2.12.3.jar, jackson-databind-2.12.3.jar, log-0.197.jar, slf4j-api-1.7.30.jar, and slf4j-nop-1.7.30.jar
.
I've checked for dependency conflicts using mvn dependency:tree
. No conflicts were found.
I've configured the Glue job to use the spark.jars property to load the jars from S3.
I've also tried using the --extra-jars option and the spark.jars.packages property to load the Trino Spark connector.
I've verified that the jars are being loaded by logging the loaded jars using spark.sparkContext._jvm.ClassLoader.getSystemResources("")
.
I've tried setting the spark.driver.extraClassPath
and spark.executor.extraClassPath
properties to include the Trino jars:
"--conf spark.driver.extraClassPath": "/opt/amazon/conf:/opt/amazon/glue-manifest.jar:" + ":".join(trino_drivers),
"--conf spark.executor.extraClassPath": "/opt/amazon/conf:/opt/amazon/glue-manifest.jar:" + ":".join(trino_drivers)
spark.executor.extraJavaOptions
and spark.driver.extraJavaOptions
properties did not work either:"--conf spark.executor.extraJavaOptions": "-Djava.class.path=/opt/amazon/conf:/opt/amazon/glue-manifest.jar:" + ":".join(trino_drivers),
"--conf spark.driver.extraJavaOptions": "-Djava.class.path=/opt/amazon/conf:/opt/amazon/glue-manifest.jar:" + ":".join(trino_drivers)
Despite these efforts, I'm still getting a ClassNotFoundException
when trying to use the Trino Spark connector. An LLM I used mentioned that spark.driver.class.path
and spark.executor.class.path
are read-only properties in Spark, and cannot be set directly. As an attempted work around I've also tried setting the CLASSPATH
directly in my main script but get the same error:
import os
os.environ['CLASSPATH'] = '/opt/amazon/conf:/opt/amazon/glue-manifest.jar:' + ':'.join(trino_drivers)
Upvotes: 1
Views: 31