Handling transient exceptions when iterating over a Service Fabric ReliableDictionary

Question

Let's say I have a method that iterates over all rows in a ReliableDictionary like so:

var reliableDictionary = await StateManager.GetOrAddAsync>(dictionaryName);

using (var tx = StateManager.CreateTransaction())
{
    var enumerable = await reliableDictionary.CreateEnumerableAsync(tx);
    var enumerator = enumerable.GetAsyncEnumerator();
    while (await enumerator.MoveNextAsync(cancellationToken))
    {
        // Read enumerator.Current and do something with the value 
        // (not writing back to the dictionary here)
    }
}

How could I handle retrying of transient exceptions here (i.e., TimeoutException, FabricNotReadableException and FabricTransientException)?

The code documentation for the enumerator is unclear on what exceptions can be thrown on each method. Which methods can throw these transient exceptions - CreateTransaction, CreateEnumerableAsync, GetAsyncEnumerator, MoveNextAsync and enumerator.Current?

If a transient exception is thrown from one of these methods, how should I retry?

If a transient exception is thrown from MoveNextAsync or enumerator.Current, can I retry it without aborting the while loop, or should I create a whole new transaction and start enumerating from the beginning again?

yoape · Accepted Answer

This article https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-work-with-reliable-collections describes working with Reliable Collections under transactions. Basically you should do the following:

retry:

try {
   // Create a new Transaction object for this partition
   using (ITransaction tx = base.StateManager.CreateTransaction()) {
      // AddAsync takes key's write lock; if >4 secs, TimeoutException
      await m_dic.AddAsync(tx, key, value, cancellationToken);

      await tx.CommitAsync();
   }
}
catch (TimeoutException) {
   await Task.Delay(100, cancellationToken); goto retry;
}

The sample usage here is with goto statement, but any retry handling should work.

You can modify the timeout if you know your transaction will take longer (as it will in your case) but you should consider the impact it might have on your solution. https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-reliable-services-reliable-collections

The default time-out is 4 seconds for all the Reliable Collection APIs. Most users should not override this.

And

Do not use TimeSpan.MaxValue for time-outs. Time-outs should be used to detect deadlocks.

As for the other exception types you mention (FabricNotReadableException and FabricTransientException), you could/should retry those as well. They are commonly thrown by Service Fabric when something changes in the configuration of your service(s), like a change in primary or if you for some reason end up talking to a secondary. Most cases it should be retryable. FabricTransientException is just a base class for a number of exceptions that can occur in the communication with Reliable Services and it indicates an exception that could go away if retried.

This answer describes FabricNotReadableException, for instance, there are some cases where you need to re-resolve your service in the client to end up on another replica.

Handling transient exceptions when iterating over a Service Fabric ReliableDictionary

Answers (1)

Related Questions