user2894233
user2894233

Reputation: 31

Retry and fallback solution in AWS Lambda when calling external API

I have an AWS Lambda function which is invoked by another service . The Lambda function calls external API endpoints and it sometime receives network time out/no response while calling external API. What is the best way to implement retry mechanism in aws lambda to handle failures if external service API is not responding or other server side errors? Also what should be the fallback mechanism strategy to use.

I followed this article below which suggests to use step functions for retry with backoff pattern implementation , is there any sample code for the same and what should be the cost considerations to keep in mind while using these services

Follwed these articles for the solution approach

Upvotes: 1

Views: 1741

Answers (1)

Bhavesh Parvatkar
Bhavesh Parvatkar

Reputation: 2233

So mainly your lambda has 3 stages:

Stage 1 (Pre-API execution): This stage runs before the API call is made.

Stage 2 (API execution): You wait for the response from the API

Stage 3 (Post API execution): You now continue your Lambda execution with data from Step 2 or halt if an error occurs.

Things to consider:

Lambda has a hard limit of 15 minutes of timeout.

Solution 1: Wrap the External Lambda with retry logic

Assumption: Your lambda can complete all retries and wait to process the next step

Example for nodejs

const axios = require('axios');

async function retryApiCall(url, maxAttempts = 3, delayBetweenAttempts = 1000) {
  let attempt = 1;

  while (attempt <= maxAttempts) {
    try {
      const response = await axios.get(url);

      if (response.status === 200) {
        return response.data;
      }
    } catch (error) {
      console.error(`Error on attempt ${attempt}:`, error.message);
    }
    attempt++;
    await new Promise(resolve => setTimeout(resolve, delayBetweenAttempts));
  }

  throw new Error(`Failed after ${maxAttempts} attempts`);
}

Pros:

  1. Easy to debug: All the executions remain in one lambda execution cycle. No need to go through separate logs.
  2. No lambda breakdown needed: As processes are handled within lambda itself, no code breakdown is needed for processes like step function.

Cons:

  1. Lambda has to wait until the API call has finished all retries in the worst case. You will be billed for the entire duration.
  2. Won't work if you need greater than 15 minutes duration for retries.

Solution 2: Step function path

When you start to consider this path. Your lambda shouldn't wait for all stage execution to complete. It will complete its first stage, start step function and end.

Phase 1: Your Lambda instead of calling API, it will start Step Function execution and end. (Only does Stage 1 as mentioned above)

Phase 2: Step function handles the retires and the delay mechanism for you. You can take advantage of a complete 15-minute execution for each retry, as there will be new lambda calls for each retry. Also, delays will be handled separately. (Stage 2)

Phase 3: New Lambda. Yes, once phase 2 is completed, its output will be sent to a new lambda and you will have to continue (Stage 3) from here.

Pros:

  1. You can take advantage of a full lambda timeout of 15 min, delay retry won't affect the lambda timeout.

Cons:

  1. As your processes are broken down into different stages, imagine how you will debug this. First, you see the Lambda in Phase 1. Then you will find the execution of the Step function in phase 2. Then look into each log of retries (they are separate for each execution). Then check the new lambda logs.
  2. Added billing, Step function billing is based on state transitions. Check out the examples here
  3. Additional data: If there is some data that you want to pass from Phase 1. You will have to think of a strategy for passing it across all phases of this step function. Also, remember that the step function has a hard limit of 256KB. If your data will be above it then something like an S3 file or alternate solution should be built. If not then you can pass it as input for all your lambda functions in Phase 2 and Phase 3.

Fallback

Both the solutions do not handle any fallback.

What is the fallback for you? If there is no guarantee whether on retries the external API will work or not. No point in going the complex route. It's a decision left on you.

Fallback Solution: Adding DLQ to handle these failed messages. If all the attempts were failed. It's better to handle them separately.

Solution 3: Add SQS to your Kafta topic instead of hitting Lambda

SQS has retry policies too.

Retry strategy: They don't have dynamic delay. It will be a static retry. It has visibility timeout which you can leverage for delay.

Implementation: Let's say your Lambda execution has a max timeout of 15 minutes and you keep a visibility timeout of 15 minutes too. If lambda fails, the message is retried only after 15 minutes (15 minutes once the lambda started executing) have passed, no matter if lambda terminated before it.

DLQ: Once all the retries are done you can push the message to DLQ. Add an alert that this message failed to process and you can take an action later.

SQS to Step function? Yes, you can add SQS to the step function too.

Upvotes: 1

Related Questions