Reputation: 1115
I am running a Rust app with Tokio in prod. In the last version i had a bug, and some requests caused my code to go into an infinite loop.
What happened is while the task that got into the loop was stuck, all the other task continue to work well and processing requests, that happened until the number of stalling tasks was high enough to cause my program to be unresponsive.
My problem is took a lot of time to our monitoring systems to identify that something go wrong. For example, the task that answer to Kubernetes' health check works well and I wasn't able to identify that I have stalled tasks in my system.
So my question is if there's a way to identify and alert in such cases?
If i could find way to define timeout on task, and if it's not return to the scheduler after X seconds/millis to mark the task as stalled, that will be a good enough solution for me.
Upvotes: 4
Views: 1678
Reputation: 11728
Tokio Console is a monitoring solution built by the Tokio team. It can be used to monitor for stalled tasks among other things.
In spirit, it is like the top command but specifically for Tokio.
https://github.com/tokio-rs/console
Upvotes: 2
Reputation: 42472
Using tracing
might be an option here: following issue 2655 every tokio task should have a span. Alongside tracing-futures this means you should get a tracing event every time a task is entered or suspended (see this example), by adding the relevant data (e.g. task id / request id / ...) you should then be able to feed this information to an analysis tool in order to know:
I think that's about the extent of it: as noted by issue 2510, tokio doesn't yet use the tracing information it generates and so provide no "built-in" introspection facilities.
Upvotes: 4