Reputation: 51
I have a Glue job that loads data from RDS to Snowflake:
This job used to insert to S3 prior to the existence of this Snowflake instance. Now trying to run it with Snowflake as sink returns this error: "IllegalArgumentException: No group with name <host>"
From the driver logs:
23/03/29 09:45:32 ERROR GlueExceptionAnalysisListener: [Glue Exception Analysis] Last Executed Line number from script job-rds-to-snowflake-visual.py: 50
23/03/29 09:45:32 ERROR GlueExceptionAnalysisListener: [Glue Exception Analysis] {"Event":"GlueETLJobExceptionEvent","Timestamp":1680083132028,"Failure Reason":"Traceback (most recent call last):\n File \"/tmp/job-rds-to-snowflake-visual.py\", line 50, in <module>\n transformation_ctx=\"SnowflakeDataCatalog_node1680082896733\",\n File \"/opt/amazon/lib/python3.6/site-packages/awsglue/dynamicframe.py\", line 819, in from_catalog\n return self._glue_context.write_dynamic_frame_from_catalog(frame, db, table_name, redshift_tmp_dir, transformation_ctx, additional_options, catalog_id)\n File \"/opt/amazon/lib/python3.6/site-packages/awsglue/context.py\", line 386, in write_dynamic_frame_from_catalog\n makeOptions(self._sc, additional_options), catalog_id)\n File \"/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py\", line 1305, in __call__\n answer, self.gateway_client, self.target_id, self.name)\n File \"/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py\", line 117, in deco\n raise converted from None\npyspark.sql.utils.IllegalArgumentException: No group with name <host>","Stack Trace":[{"Declaring Class":"deco","Method Name":"raise converted from None","File Name":"/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py","Line Number":117},{"Declaring Class":"__call__","Method Name":"answer, self.gateway_client, self.target_id, self.name)","File Name":"/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py","Line Number":1305},{"Declaring Class":"write_dynamic_frame_from_catalog","Method Name":"makeOptions(self._sc, additional_options), catalog_id)","File Name":"/opt/amazon/lib/python3.6/site-packages/awsglue/context.py","Line Number":386},{"Declaring Class":"from_catalog","Method Name":"return self._glue_context.write_dynamic_frame_from_catalog(frame, db, table_name, redshift_tmp_dir, transformation_ctx, additional_options, catalog_id)","File Name":"/opt/amazon/lib/python3.6/site-packages/awsglue/dynamicframe.py","Line Number":819},{"Declaring Class":"<module>","Method Name":"transformation_ctx=\"SnowflakeDataCatalog_node1680082896733\",","File Name":"/tmp/job-rds-to-snowflake-visual.py","Line Number":50}],"Last Executed Line number":50,"script":"job-rds-to-snowflake-visual.py"}
23/03/29 09:45:32 ERROR ProcessLauncher: Error from Python:Traceback (most recent call last):
File "/tmp/job-rds-to-snowflake-visual.py", line 50, in <module>
transformation_ctx="SnowflakeDataCatalog_node1680082896733",
File "/opt/amazon/lib/python3.6/site-packages/awsglue/dynamicframe.py", line 819, in from_catalog
return self._glue_context.write_dynamic_frame_from_catalog(frame, db, table_name, redshift_tmp_dir, transformation_ctx, additional_options, catalog_id)
File "/opt/amazon/lib/python3.6/site-packages/awsglue/context.py", line 386, in write_dynamic_frame_from_catalog
makeOptions(self._sc, additional_options), catalog_id)
File "/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
raise converted from None
pyspark.sql.utils.IllegalArgumentException: No group with name <host>
23/03/29 09:45:31 INFO GlueContext: getCatalogSink: catalogId: null, nameSpace: sf_audit_db, tableName: auditlog_dev_public_rds_auditlog, isRegisteredWithLF: false
23/03/29 09:45:26 WARN SharedState: URL.setURLStreamHandlerFactory failed to set FsUrlStreamHandlerFactory
23/03/29 09:45:24 INFO GlueContext: The DataSource in action : com.amazonaws.services.glue.JDBCDataSource
23/03/29 09:45:24 INFO GlueContext: Glue secret manager integration: secretId is not provided.
23/03/29 09:45:24 INFO GlueContext: nameSpace: pg_audit_db, tableName: supportdatabase_public_audit_log_condensed, connectionName conn-rds-pg-auditdb, vendor: postgresql
23/03/29 09:45:24 INFO GlueContext: getCatalogSource: transactionId: <not-specified> asOfTime: <not-specified> catalogPartitionIndexPredicate: <not-specified>
23/03/29 09:45:24 INFO GlueContext: getCatalogSource: catalogId: null, nameSpace: pg_audit_db, tableName: supportdatabase_public_audit_log_condensed, isRegisteredWithLF: false, isGoverned: false, isRowFilterEnabled: false, useAdvancedFiltering: false, isTableFromSchemaRegistry: false
23/03/29 09:45:22 INFO GlueContext: GlueMetrics configured and enabled
23/03/29 09:45:19 INFO Utils: Successfully started service 'sparkDriver' on port 42465.
I didn't touch the script generated as we want to keep the job in visual mode. Here's the script if it helps:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue import DynamicFrame
def sparkSqlQuery(glueContext, query, mapping, transformation_ctx) -> DynamicFrame:
for alias, frame in mapping.items():
frame.toDF().createOrReplaceTempView(alias)
result = spark.sql(query)
return DynamicFrame.fromDF(result, glueContext, transformation_ctx)
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node RDS (Data Catalog)
RDSDataCatalog_node1 = glueContext.create_dynamic_frame.from_catalog(
database="pg_audit_db",
table_name="supportdatabase_public_audit_log_condensed",
transformation_ctx="RDSDataCatalog_node1",
)
# Script generated for node SQL Query
SqlQuery0 = """
SELECT
*
FROM
webapirequestlog
"""
SQLQuery_node1679649943271 = sparkSqlQuery(
glueContext,
query=SqlQuery0,
mapping={"webapirequestlog": RDSDataCatalog_node1},
transformation_ctx="SQLQuery_node1679649943271",
)
# Script generated for node Snowflake (Data Catalog)
SnowflakeDataCatalog_node1680082896733 = glueContext.write_dynamic_frame.from_catalog(
frame=SQLQuery_node1679649943271,
database="sf_audit_db",
table_name="auditlog_dev_public_rds_auditlog",
transformation_ctx="SnowflakeDataCatalog_node1680082896733",
)
job.commit()
I have tried googling the error but there aren't any results that are helpful. Any ideas what to check?
Upvotes: 2
Views: 467
Reputation: 41
The issue is that when you define a JDBC connection for Snowflake, it can be used as a data source for the crawler, but it cannot be used in your ETL Job. You must use the snowflake connection type in your ETL job, which unfortunately cannot be used as a data source for the crawler, at least until now.
This is the link for the documentation: https://docs.aws.amazon.com/glue/latest/dg/connection-properties.html
Upvotes: 0