Dave Potts
Dave Potts

Reputation: 1643

Azure Service Bus Session FIFO - How Should Consumers Handle Processing Errors?

Please would you suggest how to handle consumer errors in an Azure Service Bus subscription set up to ensure FIFO processing using a session IDs? (See https://learn.microsoft.com/en-us/azure/service-bus-messaging/message-sessions#first-in-first-out-fifo-pattern )

As an example imagine a customer management system posting messages that are consumed by an accounting system. The messages all have the session ID as the AccountID owning the entities so that receipt from the bus is in FIFO order in the scope of each AccountID.

Imagine this message scenario:

If the consumer of the messages has the session lock on AccountID=1234, takes a PeekLock on the queue at T2 for the AddCustomer message and then suffers a transient failure of the accounting system, they are not able to add Customer 5678. What should the consumer do?

If they dead-letter the AddCustomer message, they can't go on to process the RaiseInvoice message since that will fail as the Customer 5678 doesn't exist in the accounting system.

If they abandon the AddCustomer, then are they going to spin round a loop of AddCustomer->fail->abondon->AddCustomer until the max delivery count message is reached and the message then dead-letters.

What should the consumer do here to safely respond to the issue?

See https://stackoverflow.com/a/53449282/491752 for confirmation of how the bus behaves. My question is given knowledge of this problem, what should the consumer do?

Upvotes: 3

Views: 937

Answers (1)

CiaranODonnell
CiaranODonnell

Reputation: 248

If it's a transient failure then you have two options, one would be to catch the exception yourself and retry the processing. This is what frameworks like Azure functions, masstransit, and nservicebus do. They catch your exception and then call you again with the same message. Very short lived exception circumstances might recover in that time.

The next option is to abandon the message purposely. This puts it back on the queue and it will be redelivered. This will increase the delivery count each time. The hope is that the transient failure resolves before it reaches the max delivery count. If not it will be dead lettered, and that's not ideal.

So what you could also do is tear down the whole consumer when a message processing error occurs. This would enable the session to be reallocated to another consumer and the redelivery would do to them, hopefully they would have the error.

Basically, you need to retry and/or wait in some way till the transient condition passes. You could out exponential back offs between your retries (the new client libraries should extend your lock automatically here), or delays before you teardown a consumer.

If when you say transient error you mean something that lasts and hour or more, you might need to Monitor for errors and pause entire parts of the system (disable all consumers of a queue) until you've restored whatever is broken.

This failure modeling is meat of the challenge to building reliable systems. It's also sort of the fun.

Upvotes: 2

Related Questions