Felipe
Felipe

Reputation: 7623

Glue Spark (Scala) job is not connection to postgresql RDS

I have a Glue Spark job written in Scala. Then I need to get a data source from RDS database (PostgreSQL). I created the connection in aws UI and tested it. It works so I can confirm that the Glue connection to RDS is correct setup (role, security-group).

When I added this source in my Glue Spark job I get this error on console

"INFO 2024-04-15T07:26:25,251 245857  com.amazonaws.services.glue.connectors.NativeConnectorService$  [main]  Glue connectors: Copy connector /connectors/redshift/new/redshift-jdbc42-2.1.0.16.jar to /opt/aws_glue_connectors/selected/redshift/redshift-jdbc42-2.1.0.16.jar
"INFO 2024-04-15T07:26:25,251 245857  com.amazonaws.services.glue.connectors.NativeConnectorService$  [main]  Glue connectors: Copy is finished
"Glue ETL Marketplace - Start ETL connector activation process...
"Glue ETL Marketplace - downloading jars for following connections: List(my_glue_connection) using command: List(python3, -u, -m, docker.unpack_docker_image, --connections, my_glue_connection, --result_path, jar_paths, --region, eu-west-1, --endpoint, https://glue.eu-west-1.amazonaws.com, --proxy, xx.xx.xx.xx:8888)
"2024-04-15 07:26:31,431 - __main__ - INFO - Glue ETL Marketplace - Start downloading connector jars for connection: my_glue_connection
"2024-04-15 07:26:32,492 - __main__ - INFO - Glue ETL Marketplace - using region: eu-west-1, proxy: xx.xx.xx.xx:8888 and glue endpoint: https://glue.eu-west-1.amazonaws.com to get connection: my_glue_connection
"2024-04-15 07:26:32,651 - __main__ - WARNING - Glue ETL Marketplace - Connection my_glue_connection is not a CUSTOM or Marketplace connection, skip jar downloading for it
"2024-04-15 07:26:32,651 - __main__ - INFO - Glue ETL Marketplace - successfully wrote jar paths to ""jar_paths""
"Glue ETL Marketplace - Retrieved no ETL connector jars, this may be due to no marketplace/custom connection attached to the job or failure of downloading them, please scroll back to the previous logs to find out the root cause. Container setup continues.
Glue ETL Marketplace - ETL connector activation process finished, container setup continues...
...
SdkClientException occurred : com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to aws-glue-assets-xxxxxxxx-eu-west-1.s3.eu-west-1.amazonaws.com:443 [aws-glue-assets-XXXXX-eu-west-1.s3.eu-west-1.amazonaws.com/xx.xx.xx.xx, aws-glue-assets-xxxxxxx-eu-west-1.s3.eu-west-1.amazonaws.com/xx.xx.xx.xx, aws-glue-assets-xxxxxx-eu-west-1.s3.eu-west-1.amazonaws.com/xx.xx.xx.xx, aws-glue-assets-xxxxxx-eu-west-1.s3.eu-west-1.amazonaws.com/xx.xx.xx.xx, aws-glue-assets-xxxxx-eu-west-1.s3.eu-west-1.amazonaws.com/xx.xx.xx.xx0, aws-glue-assets-xxxxxxxxx-eu-west-1.s3.eu-west-1.amazonaws.com/xx.xx.xx.xx, aws-glue-assets-xxxxxx-eu-west-1.s3.eu-west-1.amazonaws.com/xx.xx.xx.xx, aws-glue-assets-xxxxxxx-eu-west-1.s3.eu-west-1.amazonaws.com/xx.xx.xx.xx] failed: connect timed out
3 Retry(s) left

My Spark job tries to connect as:

    val jdbcUrl = s"jdbc:postgresql://$jdbcHostname:$jdbcPort/$jdbcDatabase"
    val connectionProperties = new java.util.Properties()
    connectionProperties.put("Driver", "org.postgresql.Driver")
    connectionProperties.put("user", jdbcUsername)
    connectionProperties.put("password", jdbcPassword)

    val dataFrame = spark.read.jdbc(jdbcUrl, "table-name", connectionProperties)
    dataFrame.show()

A strange message in the logs is the copy connector /connectors/redshift/new/redshift-jdbc42-2.1.0.16.jar to /opt/aws_glue_connectors/selected/redshift/redshift-jdbc42-2.1.0.16.jar. But I've never setup anything related to Redshift to the connection and the Glue Spark Job. My glue connection (written in Terraform) is a JDBC connection.

resource "aws_glue_connection" "my_glue_connection" {
  name                  = "my_glue_connection"
  connection_type       = "JDBC"
  connection_properties = {
    JDBC_CONNECTION_URL = "jdbc:postgresql://${var.rds_jdbc_hostname}:${var.rds_jdbc_port}/${var.rds_jdbc_db}"
    PASSWORD            = var.rds_jdbc_password
    USERNAME            = var.rds_jdbc_username
  }

  physical_connection_requirements {
    subnet_id              = "subnet-xxxx"
    availability_zone      = "xx-xx-xx"
    security_group_id_list = [aws_security_group.my_glue_connection_sg.id]
  }
}

The closest question that I could find is Error downloading Glue ETL Marketplace connector in AWS Glue: "LAUNCH ERROR" but offers no answer yet.

I checked this page https://repost.aws/knowledge-center/glue-marketplace-connector-errors and added the AmazonEC2ContainerRegistryReadOnly, but it made no effect.

Upvotes: 0

Views: 146

Answers (1)

Felipe
Felipe

Reputation: 7623

I solved this issue. Sharing here for completeness. I was missing 2 configurations. Adding the terraform configuration for them

1 - Glue connection needs a VPC endpoint to RDS. An interface endpoint in this case. https://repost.aws/knowledge-center/glue-connect-time-out-error

resource "aws_vpc_endpoint" "my_glue_connection_endpoint" {
  vpc_id              = "vpc-XXXXX"
  service_name        = "com.amazonaws.${var.aws_region}.glue"
  vpc_endpoint_type   = "Interface"
  private_dns_enabled = true
  subnet_ids          = ["subnet-XXXXX"]
  security_group_ids  = [aws_security_group.my_glue_connection_sg.id]
}

2 - the resource group also needed to allow all egress traffic. I was allowing only from the self-referenced SG, but it was necessary to allow all traffic as well.

resource "aws_vpc_security_group_egress_rule" "my_glue_connection_sg_egress_all" {
  description       = "security group egress rule to allow all traffic from Glue connection to RDS"
  security_group_id = aws_security_group.my_glue_connection_sg.id
  cidr_ipv4         = "0.0.0.0/0"
  ip_protocol       = "-1" # all traffic
}

Upvotes: 0

Related Questions