Reputation: 31
This has got me really stumped and hope that someone might have some insight on this. We are using Ruby and the AWS SDK to invoke a AWS Lambda synchronously. The time it take for the Lambda to complete is "usually" no more than 7 minutes.
The timeout defined in the AWS console for our Lambda is 600 seconds (10 minutes) The configuration we have AWS Lambda client object is :
{
http_read_timeout: 600,
max_attempts: 1,
retry_limit: 0
}
We have a requirement that we need to invoke this Lambda multiple times in threads. Each thread would use a different AWS Lambda Client object (but with the same above configuration) and a different event payload passed into it upon invocation. Our program waits for all the thread(s) doing the invocation to complete.
Locally, from our computers this works very reliably. However, when our program is run within ECS then we get NET::Timeout TCP socket errors. The Lambda will invoke X times. The code in the Lambda will succeed. But the AWS Lambda client in the thread(s) doing the invocation - reach the 600 timeout without receiving the response from the Lambda and fail with the NET::Timeout TCP error.
We could change our design such that:
But that is not a trivial re-design and refactor. Possibly a 6 days work of dev/test - which we don't have.
But I would be v.grateful to anyone whom has any valuable insight into this problem. Be good to have a dialogue and share some ideas.
Thank you kindly guys!
Upvotes: 1
Views: 209
Reputation: 31
We think we cracked this. Hopefully, if anyone has a similar problem, hopefully this will help them out.
The problem was client side. Specifically in the alpine docker container OS. We needed to
a). Set the tcp_keepalive_time = 300
https://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html
b). Set the tcp_syn_retries = 8
https://man7.org/linux/man-pages/man7/tcp.7.html http://willbryant.net/overriding_the_default_linux_kernel_20_second_tcp_socket_connect_timeout
We found that our program in the ECS container was successfully sending an API call to the AWS Lambda API. The Lambda API received it, triggered the Lambda. But socket on the client-side (our ECS container) was being closed. However, our app - was completely unaware of this.
We also had to monkey patch the Net::HTTP
as described in Increase connect(2) timeout in RestClient / Net::HTTP on AWS Linux
Upvotes: 0
Reputation: 756
Locally, from our computers this works very reliably. However, when our program is run within ECS then we get NET::Timeout TCP socket errors. The Lambda will invoke X times.
If it still is able to invoke the Lambda function from ECS but it times out perhaps something somewhere in the networking between them is timing it out, something with ECS, load balancer, proxy, something like that? May be worth investigating.
Tough to know without all the details but in general I would probably go for an async design as it will be able to scale and handle failures better. Instead of calling the Lambda function async directly, your threads could push messages to an SQS queue which the Lambda function reads from.
Upvotes: -1