Uber Cadence task list management

We are using Uber Cadence and periodically we run into issues on the production environment. The setup is the following:

There are multiple domains, for all domains the single BE server is registered as a worker.

What is visible in the logs is that sometimes during a query the cadence seems to lose stickiness to the BE service:

"msg":"query direct through matching failed on sticky, clearing sticky before attempting on non-sticky","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
...

In the backend in the meanwhile nothing is visible. However, during this time if I check the pollers on the cadence web client I see that the task list is there, but it is not considered as a decision handler any more (http://localhost:8088/domains/mydomain/task-lists/mytasklist/pollers). Because of this pretty much the whole environment is dead because there is nothing that can progress with the decision. The only option is to restart the backend service and let it re-register as a worker.

At this point the investigation is stuck, so some help would be appreciated.

Any help would be more than welcome! Thanks!

Upvotes: 2

Views: 1103

Answers (1)

Long Quanzheng
Long Quanzheng

Reputation: 2401

Does anyone know about how a worker or task list can lose its ability to be a decision handler

This will happen when worker stops polling for decision tasks. For example if you configure the worker only polls for activity tasks, then it will show like that. So apparently it will also happen if for some reason the worker stops polling for decision tasks.

As I understand when the stickiness is lost, cadence will check for another worker to replay the workflow and continue

Yes, as long as there is another worker polling for decision tasks. Note that Query tasks is considered as of the the decision task types. (this is a wrong design, we are working on to separate it).

From your logs:

"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"

This means that Cadence dispatch the Query tasks to a worker, and a worker accepted, but didn't respond back within timeout.

It's very highly possible that there is some bugs in your Query handler logic. The bug caused decision worker to crash(which means Cadence java client also has a bug too, user code crashing shouldn't crash worker). And then a query task loop over all instances of your worker pool, finally crashed all your decision workers.

Upvotes: 2

Related Questions