SeanOB
SeanOB

Reputation: 782

Concurrent updates on DynamoDB with Conditional Expression sometime both passing

I'm having an issue where two concurrent processes are updating a DynamoDB table within 5ms of each other and both pass the conditional expression when I expect one to throw the ConditionalCheckFailedException exception. Documentation states:

DynamoDB supports mechanisms, like conditional writes, that are necessary for distributed locks.

https://aws.amazon.com/blogs/database/building-distributed-locks-with-the-dynamodb-lock-client/

My table schema has a single Key attribute called "Id":

AttributeDefinitions:
    -
      AttributeName: "Id"
      AttributeType: "S"
  KeySchema:
    -
      AttributeName: "Id"
      KeyType: "HASH"

My conditional expression is:

string conditional = "attribute_not_exists(StartedRefreshingAt)";

Lock Method which includes the conditional expression:

private bool Lock(Configuration config)
{
    string conditional = "attribute_not_exists(StartedRefreshingAt)";       
    Dictionary<string, AttributeValue> values = new Dictionary<string, AttributeValue>{
        {":new_refresh", new AttributeValue(ToDDBDate(DateTime.Now))}};         

    try
    {           
        _dynamoDB.UpdateItemAsync(new UpdateItemRequest
        {
            TableName = TABLE_NAME,
            Key = new Dictionary<string, AttributeValue>{{"Id", new AttributeValue(config.Id)}},
            UpdateExpression = "set StartedRefreshingAt = :new_refresh",
            ConditionExpression = conditional,
            ExpressionAttributeValues = values

        }).Wait();
        return true;
    }
    catch (Exception)
    {
        return false;
    }
}

If it returns true I consider the table locked since the StartedRefreshingAt attribute now exists.

After completing some other updates to the record on other attributes, I remove the StartedRefreshingAt attribute again, effectively releasing the lock. This is the method that uses the lock method:

private async Task<Configuration> RefreshAsync(Configuration config)
{
    // concurrent executions may enter here with the same value
    // for config
    if (config.AccessTokenExpired()) // Validates the age of the config.AccessToken
    {
        _logger.LogInformation($"Refreshing expired access token for {config.Id}");

        if (Lock(config))
        {
            // The below code should not be executed concurrently
            
            var authPayload = new List<KeyValuePair<string, string>>();
            authPayload.Add(new KeyValuePair<string, string>("grant_type", "refresh_token"));
            authPayload.Add(new KeyValuePair<string, string>("refresh_token", config.RefreshToken));
            
            // 3rd party REST API call
            // IF execution 1 completed and saved first, the below API
            // call should fail with an "invlaid grant" response
            // since it will be using a stale refresh token.
            JObject authResponse = await GetAuthTokensAsync(authPayload); 

            // Bad practice, but print the auth tokens to logs here 
            // in case we need to recover a RefreshToken.
            _logger.LogInformation($"Got auth token response for {config.Id}:" +
                $" {authResponse.ToString(Formatting.None)}");
            
            config.AccessToken = authResponse["access_token"].ToString();
            config.AccessTokenCreated = DateTime.Now;
            // RefreshToken is updated whenever the API call succeeds.
            config.RefreshToken = authResponse["refresh_token"].ToString();
            config.RefreshTokenCreated = config.AccessTokenCreated;
            config.StartedRefreshingAt = null; // lock attribute
            
            await Save(config); // saves to DynamoDB, releasing the lock.
        }
        else
        {
            _logger.LogInformation($"Someone else is refreshing {config.Id} at same time, so ignoring and returning current value");
            // this will return a potentially stale config object, which client code is expected to handle
        }
    }

    return config;
}

Most of the time, when two concurrent executions run, this code successfully returns false on one of the two executions, on average around 20 times a day. That seems to indicate that the expression is evaluating correctly. But roughly once every 2 weeks, concurrent executions are returning true.

EDIT: After reviewing the answer and providing more code context, it's most likely that the issue is not DynamoDB, and execution 2's conditional write succeeds AFTER execution 1 has released the lock. It could be a consistency issue with the 3rd party OAuth server.

Now I think the issue is that execution 2 completes the OAuth API call successfully using the same refresh token as execution 1. This shouldn't be possible.

The end result is that I have a RefreshToken saved in DynamoDB that perpetually returns "invalid grant", meaning it is stale. If I look through the logs for when the log line "Got auth token response for" was written by two executions really close together, I can update the DynamoDB table manually with the RefreshToken that was logged first, it recovers.

The scenario is pretty classic for a race condition:

  1. execution 1 sets the StartedRefreshingAt attribute and returns true, goes on to complete additional work
  2. execution 2 sets the StartedRefreshingAt attribute and returns true, goes on to complete additional work
  3. execution 2 writes results of additional work back to the table
  4. execution 1 writes results of additional work back to the table, overwriting the more recent work done by execution 2.

It's also possible that sometimes steps 3 and 4 are executing in reverse, but that doesn't present a consistency problem for me since the most recent work done is the desired final result.

Upvotes: 4

Views: 2155

Answers (1)

Nadav Har&#39;El
Nadav Har&#39;El

Reputation: 13731

The race you are suggesting is very surprising, because it is exactly what DynamoDB claims its conditional updates avoids. So either Amazon have a serious bug in their implementation (which would be surprising, but not impossible), or the race is actually different than what you described in your question.

In your timeline you didn't say how your code resets "StartedRefreshingAt" to nothing. Does the same UpdateTable operation which writes the results of the work back to the table also deletes the StartedRefreshingAt attribute? Because if it's a separate write, it's theoretically possible (even if not common) for the two writes to be reordered. If StartedRefreshingAt is deleted first, at that moment the second process can start its own work - before the first process's results were written - so the problem you described can happen.

Another thing you didn't say is how your processing reads the work from the item. If you accidentally used eventual consistency for the read, instead of strong consistency, it is possible that execution 2 actually did start after execution 1 was finished, but when it read the work it needs to do - it read again the old value and not what execution 1 wrote - so execution 2 ended up repeating 1's work instead of doing new work.

I don't know if either of these guesses makes sense because I don't know the details of your application, but I think the possibility that DynamoDB consistency simply doesn't work as promised is the last guess I would make.

Upvotes: 4

Related Questions