How to avoid race conditions in a distributed lock system using replicated redis clusters (or other replicated storage systems)?

Question

We have identical services running on two Azure regional environments along with a redis system that is replicated/synchronised across the two regions. This is enterprise active-active replication.

Entries may be placed into one redis instance and will be replicated by Azure into the other region. We have a service in each region that frequently scans for these entries and when it finds an entry it will attempt to gain a distributed lock based on this entry.

The distributed lock is another redis entry created using the stackoverflow redis library:

StringSetAsync(key, value, expiry, When.NotExists, flags);

What we're finding is that the service in both regions is attempting to grab the distributed lock at roughly the same time (a few milliseconds difference) and sometimes the latency in replication means that each service obtains the lock in their region and the replication effectively "crosses over". This results in each service doing identical worker and the system produces duplicated output (which is a big problem).

In our situation replication on the redis cluster is required for a SLA. It looks like the other similar questions on StackOverflow don't involve replication.

There are various solutions we are going to investigate:

Remove replication; each writer into redis will contact the instance in each region and write directly. This may avoid the latency in replication, but the SLA would suffer.
Use Azure blob storage leases; it looks like people can obtain a short lease to a named blob and this can have replication across regions.
Use Postgres or CosmoDb; people apparently create records to act as locks and the transactional approach might avoid the race condition.

There are some more bespoke ideas we might look at, but these would be a last resort.

What we're interested in is what solution worked best for other people in this sort of situation (the requirement for replication) and whether people can suggest anything else we haven't considered yet.

How to avoid race conditions in a distributed lock system using replicated redis clusters (or other replicated storage systems)?

Answers (0)

Related Questions