Reputation: 255
Our team recently had an incident due to our stateless services being restarted for azure runtime automatic updates. One of the services was in the middle of processing a task when it was forcefully shutdown. These tasks can take as long as 4 hours.
Either through code or configuration, is there a method for letting Azure know that our services are busy and can't be shutdown as this time?
In other words, how can we let Azure know when our services are ready for the service fabric runtime upgrade?
Upvotes: 0
Views: 496
Reputation: 303
Durability tier privilege allows Service Fabric to pause any VM level infrastructure request (such as a VM reboot, VM reimage, or VM migration)
Bronze - No privileges. This is the default. Silver - The infrastructure jobs can be paused for a duration of 10 minutes per UD. Gold - The infrastructure jobs can be paused for a duration of 2 hours per UD. Gold durability can be enabled only on full node VM skus like D15_V2, G5 etc.
Upvotes: 0
Reputation: 29820
Well first of all, why don't you switch to manual upgrade mode?
Second, in the case of long running jobs you still have to take in account that nodes can fail, service instances can be moved or change role. All these kind of events will terminate your long running job if you don't handle shutdown notifications well.
The service is signaled that it will be shutdown etc. by Service Fabric by using the CancellationToken that is passed to RunAsync. The following is taken from the docs:
Service Fabric changes the Primary of a stateful service for a variety of reasons. The most common are cluster rebalancing and application upgrade. During these operations (as well as during normal service shutdown, like you'd see if the service was deleted), it is important that the service respect the CancellationToken.
Services that do not handle cancellation cleanly can experience several issues. These operations are slow because Service Fabric waits for the services to stop gracefully.
And this says the same but a bit shorter about the RunAsync method:
Make sure cancellationToken passed to RunAsync(CancellationToken) is honored and once it has been signaled, RunAsync(CancellationToken) exits gracefully as soon as possible.
In your case you should act on the CancellationToken being canceled. You should store the state of your current job somehow so you can resume it the next time RunAsync is called.
If it is really a long running job that cannot be interrupted and resumed by any means you should consider having this work done outside a Reliable Service, like a Web Job or something else. Or accept that some work might be lost.
In other words, you cannot tell Service Fabric to wait shutting down your service. It would mess up balancing and reliability of the cluster as well.
Upvotes: 1