Phil Mar
Phil Mar

Reputation: 157

Azure Functions, Event Grid and Event Redelivery

I have an Azure function app running under a Consumption plan, with functions that are triggered with the EventGridTrigger, and an Event Grid Topic that is triggering the functions. I'm relying on the retry functionality to increase reliability. If the function doesn't acknowledge the event (via a return from the function), I'm seeing that the event is re-delivered. So far so good.

I'm using EF Core with SQL Server, with a timestamp/row version column for concurrency management. More on that in a moment.

The documentation outlines that the event grid has 'at-least-once' delivery. This implies that the consumers of the events should be idempotent. We have a multi-step process, and we're making modifications for this process to be idempotent, so we're thinking about the various situations. This question involves one specific situation.

I'm reading 'at-least-once' to mean that a function could be triggered twice by the same event, and that an event could be processed by two instances of the function at the same time; maybe one is taking a long time, or the grid just delivered it twice.

If this is the case, and two function instances are processing an event at the same time, it's possible that one of them will succeed, and one could fail - either with a concurrency exception, or the function could have some other exception generated by the code or the platform. These platform exceptions are rare, but we HAVE seen indications of this (where the invocation just dies; we've seen this with single invocations, and have seen the event be re-delivered).

We've never seen any indications that an event is processed by two instances at the same time; my question assumes that this is possible, though. In that case, I'm not sure what will happen with the re-delivery:

In both of these cases, one instance acknowledges the event, the other does not. Does anyone know if the event will be redelivered in any of these cases? Does the order in which these instances run change the answer?

The actual problem is more complex: we're doing a multi-step process, and saving to the database at the end of each step. So the plan is to only allow each step to be processed once. In the case where two instances run at the same time, we're leveraging the concurrency validation in EF Core to prevent the second instance from saving the data when moving to the next 'step'. But we don't know whether to acknowledge the event - because the other instance could encounter a problem in a subsequent step. We're being 'optimistic' in that we think these problems will be rare, but we would like to get as much automatic recovery as possible.

Any insight would be helpful. We really can't change how that multi-step process is done; for one thing, the multiple steps allow us to track the progress.

Thanks in advance.

Upvotes: 3

Views: 1355

Answers (1)

Kristin
Kristin

Reputation: 1391

I couldn't find anything solid in the documentation. So I ran a few tests myself as I was intrigued by the question. I've also included a potential "solution" to your multistep issue, or rather, that it is not idempotent (yet?). Scroll through the wall of text, it's at the end.

First thoughts are that two instances could run the same event, if only and if the first invocation exceeds the default 30 seconds for response, which is why the "at-least-once" delivery is mentioned. Example:

  1. Event is published
  2. Instance A picks up the event
  3. Instance A runs for more than 30 seconds
  4. Event is added to retry queue as instance A hasn't replied yet
  5. Instance A is finished, 35 seconds total elapsed
  6. Event from retry queue is published after sitting in retry queue for 10 seconds
  7. Instance B picks up the event

I really think it comes down to the execution duration of the two instances how EventGrid reacts.

According to the documentation for Retry schedule:

If the endpoint responds within 3 minutes, Event Grid will attempt to remove the event from the retry queue on a best effort basis but duplicates may still be received.

This could be interpreted as such that as soon as an instance replies with a successful response EventGrid will attempt to clean out the retry queue (from the example above, instance B should never pick up the event). However, I'm also interested in what happens when instance A fails.

I tried with the following code, which will throw an exception from instance A after a certain delay:

public static class Function1
{
    private static Dictionary<string, int> _Counter = new();

    [FunctionName("Function1")]
    public static async Task Run([EventGridTrigger] EventGridEvent eventGridEvent, ILogger log)
    {
        if (_Counter.ContainsKey(eventGridEvent.Id))
        {
            _Counter[eventGridEvent.Id]++;
            log.LogInformation($"{eventGridEvent.Id}: {_Counter[eventGridEvent.Id]}");
            return;
        }

        _Counter.Add(eventGridEvent.Id, 1);
        var delay = (long)eventGridEvent.Data;
        log.LogInformation($"Delay: {delay}");
        await Task.Delay(TimeSpan.FromSeconds(delay));
        log.LogInformation($"{eventGridEvent.Id}: {_Counter[eventGridEvent.Id]}");
        throw new Exception($"{eventGridEvent.Id} Throwing after {delay} seconds delay");
    }
}

Results from different delays

Delay = 35

2022-02-06T00:34:10.798 [Information] Executing 'Function1' (Reason='EventGrid trigger fired at 2022-02-06T00:34:10.7899831+00:00', Id=9d1adfe9-21bd-454c-862e-a4d8d5b312e0)
2022-02-06T00:34:10.801 [Information] Delay: 35
2022-02-06T00:34:45.807 [Information] 6d3d141d-d0ee-4e81-a648-e2497e562b84: 1
2022-02-06T00:34:45.894 [Error] Executed 'Function1' (Failed, Id=9d1adfe9-21bd-454c-862e-a4d8d5b312e0, Duration=35096ms)6d3d141d-d0ee-4e81-a648-e2497e562b84 Throwing after 35 seconds delay
2022-02-06T00:34:55.949 [Information] Executing 'Function1' (Reason='EventGrid trigger fired at 2022-02-06T00:34:55.9486015+00:00', Id=58a7a6db-2f97-4bc7-b878-030b8be25711)
2022-02-06T00:34:55.949 [Information] 6d3d141d-d0ee-4e81-a648-e2497e562b84: 2
2022-02-06T00:34:55.949 [Information] Executed 'Function1' (Succeeded, Id=58a7a6db-2f97-4bc7-b878-030b8be25711, Duration=1ms)

Seems that the retry that was most likely added to the queue after 30 seconds was either removed/replaced or its scheduled execution was "updated":

  • 58-13 = 45,
  • 35 for execution duration of instance A
  • 10 for retry delay after instance A failed

Delay = 45

2022-02-06T00:37:29.526 [Information] Executing 'Function1' (Reason='EventGrid trigger fired at 2022-02-06T00:37:29.5262414+00:00', Id=c4596efa-7034-4efe-bcf9-85d502517711)
2022-02-06T00:37:29.526 [Information] Delay: 45
2022-02-06T00:38:14.524 [Information] 284fa140-1cdc-41e7-973f-d0ecd6f19692: 1
2022-02-06T00:38:14.530 [Error] Executed 'Function1' (Failed, Id=c4596efa-7034-4efe-bcf9-85d502517711, Duration=44999ms)284fa140-1cdc-41e7-973f-d0ecd6f19692 Throwing after 45 seconds delay
2022-02-06T00:38:24.590 [Information] Executing 'Function1' (Reason='EventGrid trigger fired at 2022-02-06T00:38:24.5901035+00:00', Id=43a1fafa-7773-44e5-9f3a-b4bf697865a9)
2022-02-06T00:38:24.590 [Information] 284fa140-1cdc-41e7-973f-d0ecd6f19692: 2
2022-02-06T00:38:24.590 [Information] Executed 'Function1' (Succeeded, Id=43a1fafa-7773-44e5-9f3a-b4bf697865a9, Duration=0ms)

This is sorta unexpected behavior. We'd have expected to see a 2nd invocation finish before the 1st one, presumably finishing after 40 seconds after the 1st invocation's start. This could indicate that EventGrid actually knows that the 1st instance hasn't failed, and is still running, so it's actually not adding the retry attempt to the queue just yet. It does however add it after the 1st instance fails, and the retry attempt is, as expected, executed 10 seconds later.

I wanted to test the above, and thought the magic number could be 3 minutes (also mentioned in the documentation) and not 30 seconds.

Delay = 185

2022-02-06T00:40:20.381 [Information] Executing 'Function1' (Reason='EventGrid trigger fired at 2022-02-06T00:40:20.3809687+00:00', Id=454b815f-6ca6-4b75-9272-711c7b6bf42a)
2022-02-06T00:40:20.381 [Information] Delay: 185
2022-02-06T00:43:00.393 [Information] Executing 'Function1' (Reason='EventGrid trigger fired at 2022-02-06T00:43:00.3928958+00:00', Id=23910142-d903-4b9b-bec2-a5665e8e1db7)
2022-02-06T00:43:00.393 [Information] 97d828e6-d1f3-4523-b043-1f76465de6ea: 2
2022-02-06T00:43:00.393 [Information] Executed 'Function1' (Succeeded, Id=23910142-d903-4b9b-bec2-a5665e8e1db7, Duration=0ms)
2022-02-06T00:43:25.381 [Information] 97d828e6-d1f3-4523-b043-1f76465de6ea: 2
2022-02-06T00:43:25.387 [Error] Executed 'Function1' (Failed, Id=454b815f-6ca6-4b75-9272-711c7b6bf42a, Duration=185002ms)97d828e6-d1f3-4523-b043-1f76465de6ea Throwing after 185 seconds delay

This is also sorta contradicting the documentation. The first instance executes at 40:20. At 43:00 (would've expected 43:20), it runs the 2nd instance, completes immediately. Then the 1st instance fails at 43:25 (as expected). Maybe it is 3 minutes and not 30 seconds. But EventGrid only looks at the minute-part of the timestamp. One thing we can take away from this is that the 1st instance's delayed failure doesn't trigger a 3rd invocation.

Let's flip when the exception is thrown; the 2nd invocation will throw an exception, like this:

public static class Function1
{
    private static Dictionary<string, int> _Counter = new();

    [FunctionName("Function1")]
    public static async Task Run([EventGridTrigger] EventGridEvent eventGridEvent, ILogger log)
    {
        if (_Counter.ContainsKey(eventGridEvent.Id))
        {
            _Counter[eventGridEvent.Id]++;
            throw new Exception($"{eventGridEvent.Id}: {_Counter[eventGridEvent.Id]}");
        }

        _Counter.Add(eventGridEvent.Id, 1);
        var delay = (long)eventGridEvent.Data;
        log.LogInformation($"Delay: {delay}");
        await Task.Delay(TimeSpan.FromSeconds(delay));
        log.LogInformation($"{eventGridEvent.Id} finished after {delay}: {_Counter[eventGridEvent.Id]}");
        return;
    }
}

Delay=35

2022-02-06T00:57:00.777 [Information] Executing 'Function1' (Reason='EventGrid trigger fired at 2022-02-06T00:57:00.7684489+00:00', Id=16505ac4-0cb2-4b1a-8b7c-d4adf964fe14)
2022-02-06T00:57:00.780 [Information] Delay: 35
2022-02-06T00:57:35.794 [Information] 7fce3bc0-f5d3-4b3e-bd5d-4997299aa9b7 finished after 35: 1
2022-02-06T00:57:35.796 [Information] Executed 'Function1' (Succeeded, Id=16505ac4-0cb2-4b1a-8b7c-d4adf964fe14, Duration=35028ms)

Still nothing after 30 seconds. I'm skipping 45, as I'm pretty sure it'll yield same result as 35.

Delay=185

2022-02-06T00:59:04.226 [Information] Executing 'Function1' (Reason='EventGrid trigger fired at 2022-02-06T00:59:04.2262900+00:00', Id=cac343e5-911b-4ba7-a0c7-69e883a61392)
2022-02-06T00:59:04.227 [Information] Delay: 185
2022-02-06T01:01:44.196 [Information] Executing 'Function1' (Reason='EventGrid trigger fired at 2022-02-06T01:01:44.1962398+00:00', Id=2c65d66b-57c8-4a7d-b9ec-d54959e84e3d)
2022-02-06T01:01:44.271 [Error] Executed 'Function1' (Failed, Id=2c65d66b-57c8-4a7d-b9ec-d54959e84e3d, Duration=68ms)e051565c-9900-4cef-9786-5795c9fc7f91: 2
2022-02-06T01:02:09.240 [Information] e051565c-9900-4cef-9786-5795c9fc7f91 finished after 185: 2
2022-02-06T01:02:09.240 [Information] Executed 'Function1' (Succeeded, Id=cac343e5-911b-4ba7-a0c7-69e883a61392, Duration=185014ms)
2022-02-06T01:02:14.293 [Information] Executing 'Function1' (Reason='EventGrid trigger fired at 2022-02-06T01:02:14.2930303+00:00', Id=729e8079-e33e-48f4-8c5a-48e102a6d9d1)
2022-02-06T01:02:14.300 [Error] Executed 'Function1' (Failed, Id=729e8079-e33e-48f4-8c5a-48e102a6d9d1, Duration=1ms)e051565c-9900-4cef-9786-5795c9fc7f91: 3
2022-02-06T01:03:14.395 [Information] Executing 'Function1' (Reason='EventGrid trigger fired at 2022-02-06T01:03:14.3956880+00:00', Id=5996df91-cecf-418f-a5b5-82d7680761f8)
2022-02-06T01:03:14.403 [Error] Executed 'Function1' (Failed, Id=5996df91-cecf-418f-a5b5-82d7680761f8, Duration=1ms)e051565c-9900-4cef-9786-5795c9fc7f91: 4

It kept failing after this 'till the retry policy's delay became longer than the Function instance's idle cool down (the static dictionary was obviously removed from memory as the instance was taken down).

This could indicate that EventGrid ignores any response from an invocation older than 3 minutes. That is, the 1st invocation's 185 second delayed "OK" response was ignored. All subsequent invocations failed, forcing the event to be queued for retry.

There might be some more to it than what these simple tests show. If you really must now I suggest opening an issues ticket in GitHub

Potential solution for your idempotency issue

I'm not sure I fully understand your "multipart step". But if you can wrap that part into a single session, you could chain your EventGridTrigger with a session-enabled ServiceBusTrigger.

If your events have unique identifiers (eventGridEvent.Id), then you could build a service bus message with a session id matching the event's identifier. The service bus trigger, with sessions enabled, will make sure that the event (or session, rather) is only invoked 1 at a time.

This works because the two identical events also share the same eventGridEvent.Id. You will however have to keep track of these identifiers. At the end of your service bus session you'd have to mark the ID as "processed". At the beginning of your session, check if that ID was previously processed; ignore the request if it was.

Read more about ServiceBusTrigger with "Message sessions" here and here (look at the isSessionsEnabled property) (you can search for "session" to find more on the page related to session, it's scattered around).

Upvotes: 3

Related Questions