Reputation: 1345
Via CloudFormation, I have a setup including DynamoDB tables, DAX, VPC, Lambdas (living in VPC), Security Groups (allowing access to port 8111), and so on.
Everything works, except when it doesn't.
I can access DAX from my VPC'd Lambdas 99% of the time. Except occasionally they get NoRouteException errors... seemingly randomly. Here's the output from CloudWatch for a single Lambda function doing the exact same thing each time (a DAX get). Notice how it works, fails, and then works again:
/aws/lambda/BigOnion_accountGet START RequestId: 2b732899-f380-11e7-a650-cbfe0f7dfb3d Version: $LATEST
/aws/lambda/BigOnion_accountGet END RequestId: 2b732899-f380-11e7-a650-cbfe0f7dfb3d
/aws/lambda/BigOnion_accountGet REPORT RequestId: 2b732899-f380-11e7-a650-cbfe0f7dfb3d Duration: 58.24 ms Billed Duration: 100 ms Memory Size: 768 MB Max Memory Used: 48 MB
/aws/lambda/BigOnion_accountGet START RequestId: 3b63a928-f380-11e7-a116-5bb37bb69bee Version: $LATEST
/aws/lambda/BigOnion_accountGet END RequestId: 3b63a928-f380-11e7-a116-5bb37bb69bee
/aws/lambda/BigOnion_accountGet REPORT RequestId: 3b63a928-f380-11e7-a116-5bb37bb69bee Duration: 35.01 ms Billed Duration: 100 ms Memory Size: 768 MB Max Memory Used: 48 MB
/aws/lambda/BigOnion_accountGet START RequestId: 4b7fa7f2-f380-11e7-a0c8-513a66a11e7a Version: $LATEST
/aws/lambda/BigOnion_accountGet 2018-01-07T07:56:40.643Z 3b63a928-f380-11e7-a116-5bb37bb69bee caught exception during cluster refresh: { Error: NoRouteException: not able to resolve address
at DaxClientError (/var/task/index.js:545:5)
at AutoconfSource._resolveAddr (/var/task/index.js:18400:23)
at _pull (/var/task/index.js:18421:20)
at _pullFrom.then.catch (/var/task/index.js:18462:18)
time: 1515311800643,
code: 'NoRouteException',
retryable: true,
requestId: null,
statusCode: -1,
_tubeInvalid: false,
waitForRecoveryBeforeRetrying: false }
/aws/lambda/BigOnion_accountGet 2018-01-07T07:56:40.682Z 3b63a928-f380-11e7-a116-5bb37bb69bee Error: NoRouteException: not able to resolve address
at DaxClientError (/var/task/index.js:545:5)
at AutoconfSource._resolveAddr (/var/task/index.js:18400:23)
at _pull (/var/task/index.js:18421:20)
at _pullFrom.then.catch (/var/task/index.js:18462:18)
/aws/lambda/BigOnion_accountGet END RequestId: 4b7fa7f2-f380-11e7-a0c8-513a66a11e7a
/aws/lambda/BigOnion_accountGet REPORT RequestId: 4b7fa7f2-f380-11e7-a0c8-513a66a11e7a Duration: 121.24 ms Billed Duration: 200 ms Memory Size: 768 MB Max Memory Used: 48 MB
/aws/lambda/BigOnion_accountGet START RequestId: 5b951673-f380-11e7-9818-f1effc29edd5 Version: $LATEST
/aws/lambda/BigOnion_accountGet END RequestId: 5b951673-f380-11e7-9818-f1effc29edd5
/aws/lambda/BigOnion_accountGet REPORT RequestId: 5b951673-f380-11e7-9818-f1effc29edd5 Duration: 39.42 ms Billed Duration: 100 ms Memory Size: 768 MB Max Memory Used: 48 MB
/aws/lambda/BigOnion_siteCreate START RequestId: 0ec60080-f380-11e7-afea-a95d25c6e53f Version: $LATEST
/aws/lambda/BigOnion_siteCreate END RequestId: 0ec60080-f380-11e7-afea-a95d25c6e53f
/aws/lambda/BigOnion_siteCreate REPORT RequestId: 0ec60080-f380-11e7-afea-a95d25c6e53f Duration: 3.48 ms Billed Duration: 100 ms Memory Size: 768 MB Max Memory Used: 48 MB
Any ideas what it could be?
It's presumably not the VPC and security access as 9/10 times access is perfectly fine. I have a wide range of CIDR IPs, so I don't think it's anything related to EIN provisioning... but what else?
The only hint I have is the initial error which states "caught exception during cluster refresh". What exactly is a "cluster refresh" and how could it lead to these failures?
Upvotes: 3
Views: 3478
Reputation: 7662
A "cluster refresh" is a background process used by the DAX Client to ensure that its knowledge of the cluster membership state somewhat matches reality, as the DAX client is responsible for routing requests to the appropriate node in the cluster.
Normally a failure on refresh is not an issue because the cluster state rarely changes (And thus the existing state can be reused), but on startup, the client "blocks" to get an initial membership list. If that fails, the client can't proceed as it doesn't know which node can handle which requests.
There can be a slight delay creating the VPC-connected ENI during a Lambda cold-start, which means the client cannot reach the cluster (hence, "No route to host") during initialization. One the Lambda container is running it shouldn't be an issue (you might still get the exception in the logs if there's a network hiccup, but it shouldn't affect anything).
If it only happens for you during a cold-start, retrying after a slight delay should be able to work around it.
Upvotes: 2