Turiphro
Turiphro

Reputation: 427

How to connect AWS Glue to a VPC, and access private resources?

I am trying to connect to services and databases running inside a VPC (private subnets) from an AWS Glue job. The private resources should not be exposed publicly (e.g., moving to a public subnet or setting up public load balancers).

Unfortunately, AWS Glue doesn't seem to support running inside user defined VPCs. AWS does provide something called Glue Database Connections which, when used with the Glue SDK, magically set up elastic network interfaces inside the specified VPC for Glue/Spark worker nodes. The network interfaces then tunnel traffic from Glue to a specific database inside the VPC. However, this requires the location and credentials of specific databases, and it is not clear if and when other traffic (e.g., a REST call to a service) is tunnelled through the VPC.

Is there a reliable way to setup a Glue -> VPC connection that will tunnel all traffic through a VPC?

Upvotes: 14

Views: 24535

Answers (2)

Oleksandr Lykhonosov
Oleksandr Lykhonosov

Reputation: 1378

You can create a database connection with NETWORK connection type and use that connection in your Glue job. It will allow your job to call a REST API or any other resource within your VPC.

enter image description here

https://docs.aws.amazon.com/glue/latest/dg/connection-using.html

Network (designates a connection to a data source within an Amazon Virtual Private Cloud environment (Amazon VPC))

enter image description here

https://docs.aws.amazon.com/glue/latest/dg/connection-JDBC-VPC.html

To allow AWS Glue to communicate with its components, specify a security group with a self-referencing inbound rule for all TCP ports. By creating a self-referencing rule, you can restrict the source to the same security group in the VPC and not open it to all networks.

enter image description here

Upvotes: 10

Mark B
Mark B

Reputation: 200411

However, this requires the location and credentials of specific databases, and it is not clear if and when other traffic (e.g., a REST call to a service) is tunnelled through the VPC.

I agree the documentation is confusing, but according to this paragraph on the page you linked, it appears that all traffic is indeed tunneled through the VPC, since you have to have a NAT Gateway or VPC endpoints to allow Glue to access things outside the VPC once you have configured it with VPC access:

All JDBC data stores that are accessed by the job must be available from the VPC subnet. To access Amazon S3 from within your VPC, a VPC endpoint is required. If your job needs to access both VPC resources and the public internet, the VPC needs to have a Network Address Translation (NAT) gateway inside the VPC.

Upvotes: 1

Related Questions