Reputation: 3291
I am seeing behaviour on a NodeJS Lambda that I don't fully understand. Firstly, I'm assuming there is at least one bug in our code but the issue that we're having trouble understanding is how that impact subsequence invocations on the same Lambda instance.
As best I can understand the issue I'm having is that after an initial failure my next invocation will also fail and I see an error that seems to indicate that nodejs has terminated. Then I see it re-initialise and start handling requests successfully again.
There are a few problems:
Let me try to show some of the details of how this looks. Firstly this is the second invocation, that really does fail, and where NodeJS appears to terminate:
START RequestId: a71447b0-19f8-456c-b163-47e0408bf3a1 Version: $LATEST
2019-10-17T00:26:12.295Z a71447b0-19f8-456c-b163-47e0408bf3a1 INFO Assuming lambda role: { ... }lambda',
END RequestId: a71447b0-19f8-456c-b163-47e0408bf3a1
REPORT RequestId: a71447b0-19f8-456c-b163-47e0408bf3a1 Duration: 228.19 ms Billed Duration: 300 ms Memory Size: 256 MBMax Memory Used: 158 MB
RequestId: a71447b0-19f8-456c-b163-47e0408bf3a1 Error: Runtime exited with error: exit status 129
Perhaps I'm misunderstanding that last line but it sure looks like NodeJS is terminating to me (with SIGHUP??!). After this the next invocation shows the XRay logging like we see on the first invocation on this instance:
2019-10-17T00:55:30.531Z undefined INFO AWS_XRAY_DAEMON_ADDRESS is set. Configured daemon address to 169.xxx.xxx.xxx.
2019-10-17T00:55:30.536Z undefined INFO AWS_XRAY_CONTEXT_MISSING is set. Configured context missing strategy to LOG_ERROR.
2019-10-17T00:55:30.554Z undefined INFO Subsegment streaming threshold set to: 0
So it's restarted NodeJs on my lambda instance? Is that what we should expect to see?
If I look this up in X-ray etc it is shows an error but also has a 200 status:
I think it's also worth showing the previous invocation that seems to trigger this. It is also successful and it really does return the correct values etc but after completion I see this in the logs:
2019-10-17T00:55:29.169Z 3bf23bda-e7fa-4fd7-b373-2258baff2bb8 ERROR Uncaught Exception {"errorType":"Error","errorMessage":"premature close","stack":["Error: premature close"," at Duplexify.onclose (/var/task/node_modules/end-of-stream/index.js:47:67)"," at Duplexify.emit (events.js:198:13)"," at Duplexify.EventEmitter.emit (domain.js:448:20)"," at Duplexify._destroy (/var/task/node_modules/duplexify/index.js:199:8)"," at /var/task/node_modules/duplexify/index.js:182:10"," at /var/task/node_modules/async-listener/glue.js:188:31"," at process._tickCallback (internal/process/next_tick.js:61:11)"]}
2019-10-17T00:55:29.326Z 3bf23bda-e7fa-4fd7-b373-2258baff2bb8 ERROR Uncaught Exception {"errorType":"Error","errorMessage":"premature close","stack":["Error: premature close"," at Duplexify.onclose (/var/task/node_modules/end-of-stream/index.js:47:67)"," at Duplexify.emit (events.js:198:13)"," at Duplexify.EventEmitter.emit (domain.js:448:20)"," at Duplexify._destroy (/var/task/node_modules/duplexify/index.js:199:8)"," at /var/task/node_modules/duplexify/index.js:182:10"," at /var/task/node_modules/async-listener/glue.js:188:31"," at process._tickCallback (internal/process/next_tick.js:61:11)"]}
2019-10-17T00:55:29.328Z 3bf23bda-e7fa-4fd7-b373-2258baff2bb8 ERROR Uncaught Exception {"errorType":"Error","errorMessage":"premature close","stack":["Error: premature close"," at Duplexify.onclose (/var/task/node_modules/end-of-stream/index.js:47:67)"," at Duplexify.emit (events.js:198:13)"," at Duplexify.EventEmitter.emit (domain.js:448:20)"," at Duplexify._destroy (/var/task/node_modules/duplexify/index.js:199:8)"," at /var/task/node_modules/duplexify/index.js:182:10"," at /var/task/node_modules/async-listener/glue.js:188:31"," at process._tickCallback (internal/process/next_tick.js:61:11)"]}
2019-10-17T00:55:29.329Z 3bf23bda-e7fa-4fd7-b373-2258baff2bb8 ERROR Uncaught Exception {"errorType":"Error","errorMessage":"premature close","stack":["Error: premature close"," at Duplexify.onclose (/var/task/node_modules/end-of-stream/index.js:47:67)"," at Duplexify.emit (events.js:198:13)"," at Duplexify.EventEmitter.emit (domain.js:448:20)"," at Duplexify._destroy (/var/task/node_modules/duplexify/index.js:199:8)"," at /var/task/node_modules/duplexify/index.js:182:10"," at /var/task/node_modules/async-listener/glue.js:188:31"," at process._tickCallback (internal/process/next_tick.js:61:11)"]}
2019-10-17T00:55:29.329Z 3bf23bda-e7fa-4fd7-b373-2258baff2bb8 ERROR Uncaught Exception {"errorType":"Error","errorMessage":"premature close","stack":["Error: premature close"," at Duplexify.onclose (/var/task/node_modules/end-of-stream/index.js:47:67)"," at Duplexify.emit (events.js:198:13)"," at Duplexify.EventEmitter.emit (domain.js:448:20)"," at Duplexify._destroy (/var/task/node_modules/duplexify/index.js:199:8)"," at /var/task/node_modules/duplexify/index.js:182:10"," at /var/task/node_modules/async-listener/glue.js:188:31"," at process._tickCallback (internal/process/next_tick.js:61:11)"]}
2019-10-17T00:55:29.330Z 3bf23bda-e7fa-4fd7-b373-2258baff2bb8 ERROR Uncaught Exception {"errorType":"Error","errorMessage":"premature close","stack":["Error: premature close"," at Duplexify.onclose (/var/task/node_modules/end-of-stream/index.js:47:67)"," at Duplexify.emit (events.js:198:13)"," at Duplexify.EventEmitter.emit (domain.js:448:20)"," at Duplexify._destroy (/var/task/node_modules/duplexify/index.js:199:8)"," at /var/task/node_modules/duplexify/index.js:182:10"," at /var/task/node_modules/async-listener/glue.js:188:31"," at process._tickCallback (internal/process/next_tick.js:61:11)"]}
2019-10-17T00:55:29.368Z 3bf23bda-e7fa-4fd7-b373-2258baff2bb8 ERROR Uncaught Exception {"errorType":"Error","errorMessage":"premature close","stack":["Error: premature close"," at Duplexify.onclose (/var/task/node_modules/end-of-stream/index.js:47:67)"," at Duplexify.emit (events.js:198:13)"," at Duplexify.EventEmitter.emit (domain.js:448:20)"," at Duplexify._destroy (/var/task/node_modules/duplexify/index.js:199:8)"," at /var/task/node_modules/duplexify/index.js:182:10"," at /var/task/node_modules/async-listener/glue.js:188:31"," at process._tickCallback (internal/process/next_tick.js:61:11)"]}
2019-10-17T00:55:29.388Z 3bf23bda-e7fa-4fd7-b373-2258baff2bb8 ERROR Uncaught Exception {"errorType":"Error","errorMessage":"premature close","stack":["Error: premature close"," at Duplexify.onclose (/var/task/node_modules/end-of-stream/index.js:47:67)"," at Duplexify.emit (events.js:198:13)"," at Duplexify.EventEmitter.emit (domain.js:448:20)"," at Duplexify._destroy (/var/task/node_modules/duplexify/index.js:199:8)"," at /var/task/node_modules/duplexify/index.js:182:10"," at /var/task/node_modules/async-listener/glue.js:188:31"," at process._tickCallback (internal/process/next_tick.js:61:11)"]}
2019-10-17T00:55:29.408Z 3bf23bda-e7fa-4fd7-b373-2258baff2bb8 ERROR Uncaught Exception {"errorType":"Error","errorMessage":"premature close","stack":["Error: premature close"," at Duplexify.onclose (/var/task/node_modules/end-of-stream/index.js:47:67)"," at Duplexify.emit (events.js:198:13)"," at Duplexify.EventEmitter.emit (domain.js:448:20)"," at Duplexify._destroy (/var/task/node_modules/duplexify/index.js:199:8)"," at /var/task/node_modules/duplexify/index.js:182:10"," at /var/task/node_modules/async-listener/glue.js:188:31"," at process._tickCallback (internal/process/next_tick.js:61:11)"]}
2019-10-17T00:55:29.428Z 3bf23bda-e7fa-4fd7-b373-2258baff2bb8 ERROR Uncaught Exception {"errorType":"Error","errorMessage":"premature close","stack":["Error: premature close"," at Duplexify.onclose (/var/task/node_modules/end-of-stream/index.js:47:67)"," at Duplexify.emit (events.js:198:13)"," at Duplexify.EventEmitter.emit (domain.js:448:20)"," at Duplexify._destroy (/var/task/node_modules/duplexify/index.js:199:8)"," at /var/task/node_modules/duplexify/index.js:182:10"," at /var/task/node_modules/async-listener/glue.js:188:31"," at process._tickCallback (internal/process/next_tick.js:61:11)"]}
2019-10-17T00:55:29.467Z 3bf23bda-e7fa-4fd7-b373-2258baff2bb8 ERROR Uncaught Exception {"errorType":"Error","errorMessage":"premature close","stack":["Error: premature close"," at Duplexify.onclose (/var/task/node_modules/end-of-stream/index.js:47:67)"," at Duplexify.emit (events.js:198:13)"," at Duplexify.EventEmitter.emit (domain.js:448:20)"," at Duplexify._destroy (/var/task/node_modules/duplexify/index.js:199:8)"," at /var/task/node_modules/duplexify/index.js:182:10"," at /var/task/node_modules/async-listener/glue.js:188:31"," at process._tickCallback (internal/process/next_tick.js:61:11)"]}
2019-10-17T00:55:29.468Z 3bf23bda-e7fa-4fd7-b373-2258baff2bb8 ERROR Uncaught Exception {"errorType":"Error","errorMessage":"premature close","stack":["Error: premature close"," at Duplexify.onclose (/var/task/node_modules/end-of-stream/index.js:47:67)"," at Duplexify.emit (events.js:198:13)"," at Duplexify.EventEmitter.emit (domain.js:448:20)"," at Duplexify._destroy (/var/task/node_modules/duplexify/index.js:199:8)"," at /var/task/node_modules/duplexify/index.js:182:10"," at /var/task/node_modules/async-listener/glue.js:188:31"," at process._tickCallback (internal/process/next_tick.js:61:11)"]}
2019-10-17T00:55:29.469Z 3bf23bda-e7fa-4fd7-b373-2258baff2bb8 ERROR Uncaught Exception {"errorType":"Error","errorMessage":"premature close","stack":["Error: premature close"," at Duplexify.onclose (/var/task/node_modules/end-of-stream/index.js:47:67)"," at Duplexify.emit (events.js:198:13)"," at Duplexify.EventEmitter.emit (domain.js:448:20)"," at Duplexify._destroy (/var/task/node_modules/duplexify/index.js:199:8)"," at /var/task/node_modules/duplexify/index.js:182:10"," at /var/task/node_modules/async-listener/glue.js:188:31"," at process._tickCallback (internal/process/next_tick.js:61:11)"]}
2019-10-17T00:55:29.567Z 3bf23bda-e7fa-4fd7-b373-2258baff2bb8 ERROR Uncaught Exception {"errorType":"Error","errorMessage":"premature close","stack":["Error: premature close"," at Duplexify.onclose (/var/task/node_modules/end-of-stream/index.js:47:67)"," at Duplexify.emit (events.js:198:13)"," at Duplexify.EventEmitter.emit (domain.js:448:20)"," at Duplexify._destroy (/var/task/node_modules/duplexify/index.js:199:8)"," at /var/task/node_modules/duplexify/index.js:182:10"," at /var/task/node_modules/async-listener/glue.js:188:31"," at process._tickCallback (internal/process/next_tick.js:61:11)"]}
2019-10-17T00:55:29.568Z 3bf23bda-e7fa-4fd7-b373-2258baff2bb8 ERROR Uncaught Exception {"errorType":"Error","errorMessage":"premature close","stack":["Error: premature close"," at Duplexify.onclose (/var/task/node_modules/end-of-stream/index.js:47:67)"," at Duplexify.emit (events.js:198:13)"," at Duplexify.EventEmitter.emit (domain.js:448:20)"," at Duplexify._destroy (/var/task/node_modules/duplexify/index.js:199:8)"," at /var/task/node_modules/duplexify/index.js:182:10"," at /var/task/node_modules/async-listener/glue.js:188:31"," at process._tickCallback (internal/process/next_tick.js:61:11)"]}
No doubt this is a bug in our code. It's not obvious to me what that might be because these libraries are not ones that our code uses. Indeed these appear to be used by AWS XRay. So is there a failure in XRay and that's causing us some failed invocations?
I'm trying to understand:
Here are some things we're going to try we tried that did not help:
Update: In the "normal" cases including the one where several errors are logged we do the same thing. We subscribe to an IoT topic, publish a message and wait up to 5 seconds for a response. None of these executions have timed out waiting for a response (we wait for up to 5 seconds and return an error if no response is received) and none of the Lambda executions have timed out. In the case above (request id a71447b0-19f8-456c-b163-47e0408bf3a1 in logs) we don't do that because as far as I can tell our code does not run - there are no logs to indicate that it does.
This does not happen frequently. I'm looking at roughly 1 in 25 executions over the last few days. This is not seeing a lot of load so we've only hit about 190 invocations (of which 8 have failed so far).
Upvotes: 1
Views: 705
Reputation: 3291
Reporting back, several months later, to describe what this really was and how to resolve it.
It was not obvious (but in hindsight perhaps it should have been) that the root of this was specifically about using MQTT.js (as part of the AWS IOT SDK) in a lambda. In particular when the lambda is not active it will sometimes get disconnected - that's reasonable to expect. What is happening here is that MQTT.js fires an event when it is disconnected and that can run at unpredictable times.
The long and short of it is that we need to handle both error
and close
events. If they are not handled then we have an unhandled exception as I've described above. Resolving this issue was very very simple - we just had to handle those events and create a new client etc.
I think we should have resolved this much more simply than we did - it wasn't obvious to us that it was related to MQTT (but once we realised it was it was relatively quick to resolve).
As a side note it's probably worth pointing out that the latest AWS SDK for IoT does not use MQTTjs underneath anyway. I don't know exactly what the motivation for that would be but I don't know how useful this question/answer is going to be in the future given that change.
Upvotes: 1