Lambda in VPC has Connection Error against AWS API

Question

I have a Python lambda that needs to access the AWS API. If it is not associated with a VPC subnet, it works. But when it is associated with a VPC subnet, it gets the exception botocore.exceptions.EndpointConnectionError with the message Could not connect to the endpoint URL: "https://ec2.us-east-1.amazonaws.com/". I've seen this kind of problem described here and here, usually caused by a missing NAT gateway route. However, I have all the correct "pieces" and it still doesn't work.

What I have is:

A Lambda with associations that create two logical ENI's:
- two private subnets, one for each AZ.
- one security group that is fully permissive (all inbound/outbound traffic allowed)
The private subnets have custom route tables with a route sending 0.0.0.0/0 to a NAT Gateway
Each NAT Gateway is in a public subnet (one for each AZ) which has a custom route table with a route sending 0.0.0.0/0 to the IGW for the VPC.
All the subnets are associated with a NACL that allows all inbound and outbound traffic.

When I inspect Flow Logs, I see that the lambda ENI's are successfully originating DNS requests (port 53), like this one:

2 405857719141 eni-03bb24a034d226e5c 10.136.95.104 10.136.7.233 38109 53 17 1 73 1571250675 1571250733 ACCEPT OK

There are no other VPC flow log records besides this...nothing indicating "REJECTED". My actual Python code, which works when the lambda is not associated with a VPC, looks something like this:

def lambda_handler(event, context):
    from botocore.client import Config
    from botocore.session import Session

    logger.info(f'Create Session')
    s = Session()
    logger.info(f'Session Created')
    logger.info(f'fetching client')
    ec2_res = boto3.resource('ec2')
    logger.info(f'got vpc resource')

    #I've tried different approaches to creating a client
    ec2_client = s.create_client('ec2',config=Config(connect_timeout=45, read_timeout=45, retries={'max_attempts': 0}))
    #ec2_client = boto3.client('ec2', config=config)
    #ec2_client = boto3.client('ec2', endpoint_url="https://aws.amazon.com/ec2",config=config)
    #ec2_client = boto3.client('ec2',endpoint_url=endpoint)
    logger.info(f'fetched_client')

    route_table_id = os.environ['fromTGWRouteTableId']
    logger.info(f'got route table id from environment')

    try: 
        logger.info(f'route table(s):{route_table_id}')
        #this request will throw an exception in 40 seconds.
        route_table = ec2_client.describe_route_tables(RouteTableIds=[route_table_id])
        logger.info(f'got client response for route_tables')
        rt = route_table['RouteTables'][0]
        logger.info(f'The RT ID is: {rt.id}')
    except Exception as e:
        logger.info(f'{type(e)}')
        logger.info(f'{e}')

    return

I had to adjust the lambda and boto3 client timeouts just right to actually capture the error. Anything else would result in a timeout. Here are the CloudWatch log entries for the lambda:

START RequestId: a3ce5c07-f5ec-4b91-b6d8-c94e05fbecc9 Version: $LATEST
[INFO] 2019-10-16T20:22:38.914Z a3ce5c07-f5ec-4b91-b6d8-c94e05fbecc9 Create Session
[INFO] 2019-10-16T20:22:39.36Z a3ce5c07-f5ec-4b91-b6d8-c94e05fbecc9 Session Created
[INFO] 2019-10-16T20:22:39.92Z a3ce5c07-f5ec-4b91-b6d8-c94e05fbecc9 fetching client
[INFO] 2019-10-16T20:22:39.94Z a3ce5c07-f5ec-4b91-b6d8-c94e05fbecc9 Found credentials in environment variables.
[INFO] 2019-10-16T20:22:40.33Z a3ce5c07-f5ec-4b91-b6d8-c94e05fbecc9 got vpc resource
[INFO] 2019-10-16T20:22:40.33Z a3ce5c07-f5ec-4b91-b6d8-c94e05fbecc9 fetched_client
[INFO] 2019-10-16T20:22:40.33Z a3ce5c07-f5ec-4b91-b6d8-c94e05fbecc9 got route table id from environment
[INFO] 2019-10-16T20:22:40.33Z a3ce5c07-f5ec-4b91-b6d8-c94e05fbecc9 route table(s):rtb-0d92f4db98072d6fc
[INFO] 2019-10-16T20:22:49.493Z bd2bf6b7-2fa6-46ea-8115-cb830cb07f32 
[INFO] 2019-10-16T20:22:49.493Z bd2bf6b7-2fa6-46ea-8115-cb830cb07f32 Could not connect to the endpoint URL: "https://ec2.us-east-1.amazonaws.com/"
END RequestId: bd2bf6b7-2fa6-46ea-8115-cb830cb07f32
REPORT RequestId: bd2bf6b7-2fa6-46ea-8115-cb830cb07f32 Duration: 40960.97 ms Billed Duration: 41000 ms Memory Size: 128 MB Max Memory Used: 83 MB
2 unknown eni-07003b087845964ff - - - - - - - 1571257388 1571257400 - NODATA

Any ideas of what I'm overlooking?

Update

In my Python code, I've added the following test:

contents = urllib.request.urlopen("https://google.com").readline()
logger.info(f'http response: {contents}')

The above throws a URLError with the message urlopen error [Errno -3] Temporary failure in name resolution.

I then created an Ubuntu EC2 instance in a public subnet of my VPC. A ping test to google.com failed with "unknown host". If I explicitly provide a public internet IP address to ping, then it works.

Likewise host and dig failed, as shown:

ubuntu@ip-10-136-80-220:/etc$ host google.com
;; connection timed out; no servers could be reached
ubuntu@ip-10-136-80-220:/etc$ dig google.com

; <<>> DiG 9.10.3-P4-Ubuntu <<>> google.com
;; global options: +cmd
;; connection timed out; no servers could be reached

I can make dig succeed if I explicitly point it at a public DNS server. This worked: dig @8.8.8.8 google.com.

Below are the contents of my resolv.conf, with the real company name masked with "mycompany.com":

# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
#     DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 10.136.7.233
nameserver 10.136.7.249
search preprod.awse1.mycompany.com

The above corresponds to the following DHCP option set.

domain-name = preprod.awse1.mycompany.com; domain-name-servers = 10.136.7.233, 10.136.7.249;

I think both of the above DNS servers are provided in a different AWS account. Still, a ping test fails on both of those DNS server addresses. I'm not sure if that means these servers don't exist, or if they simply do not respond to ICMP.

Just now I created my own DHCP Option Set, the same as the above, but I changed the DNS servers to 8.8.8.8 and 8.8.4.4 and associated this to the VPC. I then revised my lambda to output the contents of /etc/resolv.conf to verify it "took" the new 8.8.8.8/8.8.4.4 DNS servers - and the lambda still got the same DNS errors! It is very strange that an explicit dig @8.8.8.8 google.com from the EC2 instance works, but a lambda associated with the same subnet gets a DNS error. I'm wondering if the ephemeral ENI's associated with the Lambda have their own DNS server records - and they are not updating quickly enough to reflect the changes I've made to my lambda?

Incidentally, the VPC has "DNS resolution" and "DNS hostnames" both enabled.

Why would DNS not be working? As shown, it doesn't matter whether I'm using my own DNS servers or those provided by google.

Brent Arias · Accepted Answer

I've now resolved this issue. Central to the problem was that the DHCP Option Set for my VPC mandated DNS servers that were located in a different VPC. My VPC-enabled lambda was associated with a subnet that did not have a route table entry to specifically address DNS. What I needed was a route in the route table that would point those DNS server addresses through the transit gateway to reach the other VPC. Instead my subnet had a route which directed DNS requests to the public internet (which, unsurprisingly, did not find a path to the other VPC).

There was another subnet in my VPC which did have a route table and route which pointed to the transit gateway. This other subnet was also eligible for having the associated lambda. So merely changing the subnet used by my lambda was sufficient to make the whole thing work.

Discovering the source of this problem was hampered by other factors, such as AWS Flow Logs showing "ACCEPT / OK" for DNS requests - when in fact they were not working. I need better mastery of interpreting flow logs (and ignoring them at the appropriate times).

Lambda in VPC has Connection Error against AWS API

Update

Answers (1)

Related Questions