Reputation: 31
I have an AWS Lambda function which is invoked by another service . The Lambda function calls external API endpoints and it sometime receives network time out/no response while calling external API. What is the best way to implement retry mechanism in aws lambda to handle failures if external service API is not responding or other server side errors? Also what should be the fallback mechanism strategy to use.
I followed this article below which suggests to use step functions for retry with backoff pattern implementation , is there any sample code for the same and what should be the cost considerations to keep in mind while using these services
Follwed these articles for the solution approach
Upvotes: 1
Views: 1741
Reputation: 2233
So mainly your lambda has 3 stages:
Stage 1 (Pre-API execution): This stage runs before the API call is made.
Stage 2 (API execution): You wait for the response from the API
Stage 3 (Post API execution): You now continue your Lambda execution with data from Step 2 or halt if an error occurs.
Things to consider:
Lambda has a hard limit of 15 minutes of timeout.
Solution 1: Wrap the External Lambda with retry logic
Assumption: Your lambda can complete all retries and wait to process the next step
Example for nodejs
const axios = require('axios');
async function retryApiCall(url, maxAttempts = 3, delayBetweenAttempts = 1000) {
let attempt = 1;
while (attempt <= maxAttempts) {
try {
const response = await axios.get(url);
if (response.status === 200) {
return response.data;
}
} catch (error) {
console.error(`Error on attempt ${attempt}:`, error.message);
}
attempt++;
await new Promise(resolve => setTimeout(resolve, delayBetweenAttempts));
}
throw new Error(`Failed after ${maxAttempts} attempts`);
}
Pros:
Cons:
Solution 2: Step function path
When you start to consider this path. Your lambda shouldn't wait for all stage execution to complete. It will complete its first stage, start step function and end.
Phase 1: Your Lambda instead of calling API, it will start Step Function execution and end. (Only does Stage 1 as mentioned above)
Phase 2: Step function handles the retires and the delay mechanism for you. You can take advantage of a complete 15-minute execution for each retry, as there will be new lambda calls for each retry. Also, delays will be handled separately. (Stage 2)
Phase 3: New Lambda. Yes, once phase 2 is completed, its output will be sent to a new lambda and you will have to continue (Stage 3) from here.
Pros:
Cons:
Fallback
Both the solutions do not handle any fallback.
What is the fallback for you? If there is no guarantee whether on retries the external API will work or not. No point in going the complex route. It's a decision left on you.
Fallback Solution: Adding DLQ to handle these failed messages. If all the attempts were failed. It's better to handle them separately.
Solution 3: Add SQS to your Kafta topic instead of hitting Lambda
SQS has retry policies too.
Retry strategy: They don't have dynamic delay. It will be a static retry. It has visibility timeout which you can leverage for delay.
Implementation: Let's say your Lambda execution has a max timeout of 15 minutes and you keep a visibility timeout of 15 minutes too. If lambda fails, the message is retried only after 15 minutes (15 minutes once the lambda started executing) have passed, no matter if lambda terminated before it.
DLQ: Once all the retries are done you can push the message to DLQ. Add an alert that this message failed to process and you can take an action later.
SQS to Step function? Yes, you can add SQS to the step function too.
Upvotes: 1