Reputation: 7623
I have a Glue Spark job written in Scala. Then I need to get a data source from RDS database (PostgreSQL). I created the connection in aws UI and tested it. It works so I can confirm that the Glue connection to RDS is correct setup (role, security-group).
When I added this source in my Glue Spark job I get this error on console
"INFO 2024-04-15T07:26:25,251 245857 com.amazonaws.services.glue.connectors.NativeConnectorService$ [main] Glue connectors: Copy connector /connectors/redshift/new/redshift-jdbc42-2.1.0.16.jar to /opt/aws_glue_connectors/selected/redshift/redshift-jdbc42-2.1.0.16.jar
"INFO 2024-04-15T07:26:25,251 245857 com.amazonaws.services.glue.connectors.NativeConnectorService$ [main] Glue connectors: Copy is finished
"Glue ETL Marketplace - Start ETL connector activation process...
"Glue ETL Marketplace - downloading jars for following connections: List(my_glue_connection) using command: List(python3, -u, -m, docker.unpack_docker_image, --connections, my_glue_connection, --result_path, jar_paths, --region, eu-west-1, --endpoint, https://glue.eu-west-1.amazonaws.com, --proxy, xx.xx.xx.xx:8888)
"2024-04-15 07:26:31,431 - __main__ - INFO - Glue ETL Marketplace - Start downloading connector jars for connection: my_glue_connection
"2024-04-15 07:26:32,492 - __main__ - INFO - Glue ETL Marketplace - using region: eu-west-1, proxy: xx.xx.xx.xx:8888 and glue endpoint: https://glue.eu-west-1.amazonaws.com to get connection: my_glue_connection
"2024-04-15 07:26:32,651 - __main__ - WARNING - Glue ETL Marketplace - Connection my_glue_connection is not a CUSTOM or Marketplace connection, skip jar downloading for it
"2024-04-15 07:26:32,651 - __main__ - INFO - Glue ETL Marketplace - successfully wrote jar paths to ""jar_paths""
"Glue ETL Marketplace - Retrieved no ETL connector jars, this may be due to no marketplace/custom connection attached to the job or failure of downloading them, please scroll back to the previous logs to find out the root cause. Container setup continues.
Glue ETL Marketplace - ETL connector activation process finished, container setup continues...
...
SdkClientException occurred : com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to aws-glue-assets-xxxxxxxx-eu-west-1.s3.eu-west-1.amazonaws.com:443 [aws-glue-assets-XXXXX-eu-west-1.s3.eu-west-1.amazonaws.com/xx.xx.xx.xx, aws-glue-assets-xxxxxxx-eu-west-1.s3.eu-west-1.amazonaws.com/xx.xx.xx.xx, aws-glue-assets-xxxxxx-eu-west-1.s3.eu-west-1.amazonaws.com/xx.xx.xx.xx, aws-glue-assets-xxxxxx-eu-west-1.s3.eu-west-1.amazonaws.com/xx.xx.xx.xx, aws-glue-assets-xxxxx-eu-west-1.s3.eu-west-1.amazonaws.com/xx.xx.xx.xx0, aws-glue-assets-xxxxxxxxx-eu-west-1.s3.eu-west-1.amazonaws.com/xx.xx.xx.xx, aws-glue-assets-xxxxxx-eu-west-1.s3.eu-west-1.amazonaws.com/xx.xx.xx.xx, aws-glue-assets-xxxxxxx-eu-west-1.s3.eu-west-1.amazonaws.com/xx.xx.xx.xx] failed: connect timed out
3 Retry(s) left
My Spark job tries to connect as:
val jdbcUrl = s"jdbc:postgresql://$jdbcHostname:$jdbcPort/$jdbcDatabase"
val connectionProperties = new java.util.Properties()
connectionProperties.put("Driver", "org.postgresql.Driver")
connectionProperties.put("user", jdbcUsername)
connectionProperties.put("password", jdbcPassword)
val dataFrame = spark.read.jdbc(jdbcUrl, "table-name", connectionProperties)
dataFrame.show()
A strange message in the logs is the copy connector /connectors/redshift/new/redshift-jdbc42-2.1.0.16.jar to /opt/aws_glue_connectors/selected/redshift/redshift-jdbc42-2.1.0.16.jar
. But I've never setup anything related to Redshift to the connection and the Glue Spark Job. My glue connection (written in Terraform) is a JDBC connection.
resource "aws_glue_connection" "my_glue_connection" {
name = "my_glue_connection"
connection_type = "JDBC"
connection_properties = {
JDBC_CONNECTION_URL = "jdbc:postgresql://${var.rds_jdbc_hostname}:${var.rds_jdbc_port}/${var.rds_jdbc_db}"
PASSWORD = var.rds_jdbc_password
USERNAME = var.rds_jdbc_username
}
physical_connection_requirements {
subnet_id = "subnet-xxxx"
availability_zone = "xx-xx-xx"
security_group_id_list = [aws_security_group.my_glue_connection_sg.id]
}
}
The closest question that I could find is Error downloading Glue ETL Marketplace connector in AWS Glue: "LAUNCH ERROR" but offers no answer yet.
I checked this page https://repost.aws/knowledge-center/glue-marketplace-connector-errors and added the AmazonEC2ContainerRegistryReadOnly
, but it made no effect.
Upvotes: 0
Views: 146
Reputation: 7623
I solved this issue. Sharing here for completeness. I was missing 2 configurations. Adding the terraform configuration for them
1 - Glue connection needs a VPC endpoint to RDS. An interface endpoint in this case. https://repost.aws/knowledge-center/glue-connect-time-out-error
resource "aws_vpc_endpoint" "my_glue_connection_endpoint" {
vpc_id = "vpc-XXXXX"
service_name = "com.amazonaws.${var.aws_region}.glue"
vpc_endpoint_type = "Interface"
private_dns_enabled = true
subnet_ids = ["subnet-XXXXX"]
security_group_ids = [aws_security_group.my_glue_connection_sg.id]
}
2 - the resource group also needed to allow all egress traffic. I was allowing only from the self-referenced SG, but it was necessary to allow all traffic as well.
resource "aws_vpc_security_group_egress_rule" "my_glue_connection_sg_egress_all" {
description = "security group egress rule to allow all traffic from Glue connection to RDS"
security_group_id = aws_security_group.my_glue_connection_sg.id
cidr_ipv4 = "0.0.0.0/0"
ip_protocol = "-1" # all traffic
}
Upvotes: 0