Reputation: 3185
I am experiencing unexpected spikes in timeouts for an AWS Lambda function. The function is called ~7 million times per day - with a constant numbers of calls occurring every 5 minutes. The spikes I see are usually ~1,000 function timeouts within a half hour period. The rest of the day there are generally no timeouts.
Here is an example day with timeouts counted on the y-axis.
The timeout setting on the function is 30 seconds. The function has an average runtime of ~50ms and an expected maximum runtime of ~5 seconds for large input. The function uses the Python3.6 runtime and does not utilize the ENI/VPC Lambda feature, thus cold starts generally only take a couple of seconds.
In order to investigate the timeouts, I dug through the CloudWatch logs while the timeouts were occurring. There were no log messages or exceptions from the timed-out invocations, just the message: Task timed out after 30.03 seconds
.
Originally the function ran in a single thread, so I assumed the code might be hanging somewhere and failing to log. I attempted to add more information to the logs by modifying the function to operate like this:
def business_logic():
...
queue.put(result)
def lambda_handler(event, context):
thread = threading.Thread(target=business_logic)
thead.start()
while queue.empty():
if seconds_running > 10:
comprehensive_logging()
I tested this functioned correctly by adding sleeps in business_logic
and confirming that the function did comprehensive_logging
. However, during the actual timeouts comprehensive_logging
was never reached.
From this experiment I concluded that the timed-out Lambda invocation was never reaching my code. Due to the spiking nature of the timeouts, I have a suspicion that one of the Lambda microVMs running my function is getting into a bad state, and is unable to process requests for a period before it is recycled.
Question:
Other details:
Note: I am reluctant to add comprehensive logging for all calls due to the cost of CloudWatch.
Upvotes: 4
Views: 2623
Reputation: 3302
Have you considered changing the timeout setting for the function? 30 might be the default with chalice is the default. But if that's getting exceeded, then you won't really get the logging, since it is being terminated before that happens. I would set the timeout to a few minutes, and then look at the logs for long running functions to see if that could identify the issue. You won't see timeouts on the graph, because that only happens when it gets terminated. But Lambda logs the execution time, so you should be able to filter based on that.
Even if you up the timeout, if the function normally completes faster, you will still only be billed for what is used.
What does the function call? External HTTP APIs? AWS Services?
Also, you might want to look at the AWS X-Ray SDK for debugging, it might give you more insight on the Lambda executions that fail.
Upvotes: 1