peter bray
peter bray

Reputation: 1345

Intermittent DynamoDB DAX errors: NoRouteException during cluster refresh

Via CloudFormation, I have a setup including DynamoDB tables, DAX, VPC, Lambdas (living in VPC), Security Groups (allowing access to port 8111), and so on.

Everything works, except when it doesn't.

I can access DAX from my VPC'd Lambdas 99% of the time. Except occasionally they get NoRouteException errors... seemingly randomly. Here's the output from CloudWatch for a single Lambda function doing the exact same thing each time (a DAX get). Notice how it works, fails, and then works again:

/aws/lambda/BigOnion_accountGet START RequestId: 2b732899-f380-11e7-a650-cbfe0f7dfb3d Version: $LATEST
/aws/lambda/BigOnion_accountGet END RequestId: 2b732899-f380-11e7-a650-cbfe0f7dfb3d
/aws/lambda/BigOnion_accountGet REPORT RequestId: 2b732899-f380-11e7-a650-cbfe0f7dfb3d  Duration: 58.24 ms  Billed Duration: 100 ms     Memory Size: 768 MB Max Memory Used: 48 MB
/aws/lambda/BigOnion_accountGet START RequestId: 3b63a928-f380-11e7-a116-5bb37bb69bee Version: $LATEST
/aws/lambda/BigOnion_accountGet END RequestId: 3b63a928-f380-11e7-a116-5bb37bb69bee
/aws/lambda/BigOnion_accountGet REPORT RequestId: 3b63a928-f380-11e7-a116-5bb37bb69bee  Duration: 35.01 ms  Billed Duration: 100 ms     Memory Size: 768 MB Max Memory Used: 48 MB
/aws/lambda/BigOnion_accountGet START RequestId: 4b7fa7f2-f380-11e7-a0c8-513a66a11e7a Version: $LATEST
/aws/lambda/BigOnion_accountGet 2018-01-07T07:56:40.643Z    3b63a928-f380-11e7-a116-5bb37bb69bee    caught exception during cluster refresh: { Error: NoRouteException: not able to resolve address
    at DaxClientError (/var/task/index.js:545:5)
    at AutoconfSource._resolveAddr (/var/task/index.js:18400:23)
    at _pull (/var/task/index.js:18421:20)
    at _pullFrom.then.catch (/var/task/index.js:18462:18)
  time: 1515311800643,
  code: 'NoRouteException',
  retryable: true,
  requestId: null,
  statusCode: -1,
  _tubeInvalid: false,
  waitForRecoveryBeforeRetrying: false }
/aws/lambda/BigOnion_accountGet 2018-01-07T07:56:40.682Z    3b63a928-f380-11e7-a116-5bb37bb69bee    Error: NoRouteException: not able to resolve address
    at DaxClientError (/var/task/index.js:545:5)
    at AutoconfSource._resolveAddr (/var/task/index.js:18400:23)
    at _pull (/var/task/index.js:18421:20)
    at _pullFrom.then.catch (/var/task/index.js:18462:18)
/aws/lambda/BigOnion_accountGet END RequestId: 4b7fa7f2-f380-11e7-a0c8-513a66a11e7a
/aws/lambda/BigOnion_accountGet REPORT RequestId: 4b7fa7f2-f380-11e7-a0c8-513a66a11e7a  Duration: 121.24 ms Billed Duration: 200 ms     Memory Size: 768 MB Max Memory Used: 48 MB
/aws/lambda/BigOnion_accountGet START RequestId: 5b951673-f380-11e7-9818-f1effc29edd5 Version: $LATEST
/aws/lambda/BigOnion_accountGet END RequestId: 5b951673-f380-11e7-9818-f1effc29edd5
/aws/lambda/BigOnion_accountGet REPORT RequestId: 5b951673-f380-11e7-9818-f1effc29edd5  Duration: 39.42 ms  Billed Duration: 100 ms     Memory Size: 768 MB Max Memory Used: 48 MB
/aws/lambda/BigOnion_siteCreate START RequestId: 0ec60080-f380-11e7-afea-a95d25c6e53f Version: $LATEST
/aws/lambda/BigOnion_siteCreate END RequestId: 0ec60080-f380-11e7-afea-a95d25c6e53f
/aws/lambda/BigOnion_siteCreate REPORT RequestId: 0ec60080-f380-11e7-afea-a95d25c6e53f  Duration: 3.48 ms   Billed Duration: 100 ms     Memory Size: 768 MB Max Memory Used: 48 MB

Any ideas what it could be?

It's presumably not the VPC and security access as 9/10 times access is perfectly fine. I have a wide range of CIDR IPs, so I don't think it's anything related to EIN provisioning... but what else?

The only hint I have is the initial error which states "caught exception during cluster refresh". What exactly is a "cluster refresh" and how could it lead to these failures?

Upvotes: 3

Views: 3478

Answers (1)

Jeff Hardy
Jeff Hardy

Reputation: 7662

A "cluster refresh" is a background process used by the DAX Client to ensure that its knowledge of the cluster membership state somewhat matches reality, as the DAX client is responsible for routing requests to the appropriate node in the cluster.

Normally a failure on refresh is not an issue because the cluster state rarely changes (And thus the existing state can be reused), but on startup, the client "blocks" to get an initial membership list. If that fails, the client can't proceed as it doesn't know which node can handle which requests.

There can be a slight delay creating the VPC-connected ENI during a Lambda cold-start, which means the client cannot reach the cluster (hence, "No route to host") during initialization. One the Lambda container is running it shouldn't be an issue (you might still get the exception in the logs if there's a network hiccup, but it shouldn't affect anything).

If it only happens for you during a cold-start, retrying after a slight delay should be able to work around it.

Upvotes: 2

Related Questions