I have something that's been quite the head scratcher for me... I have a small service that reads messages from an Azure ServiceBus queue and stores the data in a CosmosDB collection. The problem is that I can't get my service to scale. I have been able to optimize things to improve the number of messages read per second for one instance of the service. However, adding more instances of the service slightly degrades the number of messages read per second in total! It's important to note that sending messages to the queue in batches works like a charm, I can send 1000-2000 messages per second to the queue without any issues. Reading from the queue is the issue. My handler is slightly CPU intensive, and the messages range from approximately 2 KB to 900 KB in size, the average being somewhere around 25 KB. I've gotten one instance to handle approximately 41.5 messages per second now. If I add a second instance of the service (which is an Azure Web App by the way) the total number of messages read per second for all instances drops to approximately 40. Adding yet another instance decreases it to closer to 38. The actual code that reads messages from the queue (and handles retries, deadlettering etc.) is part of an internal company framework, which a lot of other services use, none of which have this issue. Other services have the expected behavior that performance scales linearly with the number of service instances (up to the max that ServiceBus can handle, obviously). I have the same problem on two different Azure subscriptions (TEST and PROD) which both use the Premium ServiceBus tier. I am not using sessions on the queue. Has anyone here ever had a similar issue, and how did you solve it? Things I've tried: Switching out only the code that has to do with getting things off of ServiceBus with code reading from blob storage instead. This gave me orders of magnitude better throughput to CosmosDB (approximately 15,000 documents written pr. second vs. 41.6 when the source was ServiceBus). I got such high throughput by scaling out, which is what I'm having issues with when using ServiceBus So CosmosDb is definitely not the bottleneck. I have tried deleting and recreating the queue as well as tweaking various things in my code - but logically, it seems to me, that whatever I do in my code should only be able to impact the performance of a single instance of the service. I have tried running the service locally on my computer while it was also running in Azure. From Splunk logs I could then see that during a representative minute the Azure instance of the service handled 1371 messages, while the local instance of the service only handled 23 messages. So, as I've been saying, there seems to be some kind of deadlock or something going on here. Further analysis shows that the instance in Azure spent an average of 247 milliseconds processing a message, while the local instance spent an average of 66 seconds ! If a lock expires or an unhandled exception otherwise occurs, a message is put back on the queue, and deadlettered after 10 failed delivery attempts. So it seems maybe most of the messages processed locally failed and were put back on the queue and then processed by the Azure instance eventually (this is my guess). The only shared resources between instances of my web app are ServiceBus and CosmosDb, and as noted above, I've ruled out CosmosDb. However, seeing as I'm having the same issue in both our TEST and PROD subscriptions (our DEV subscription doesn't allow scaling out), and I've tried recreating the queue a few times in various different ways, it can't be the queue itself either, and none of the other queues in use on the same ServiceBus instance are having this issue. Tweaking/Optimizing code has, as expected, only had impact on the performance of one instance. The possible, as far as I can tell, external bottlenecks have been ruled out. The one remaining thing, our internal framework which handles the actual reading of the messages from the queue, has also been ruled out by the fact that the exact same version of the framework is used in many other web apps where scaling out has been demonstrated to work. I feel pretty check-mated out here... SOLUTION: Forgot to update this question, so at last here it is... We eventually managed to set aside time to focus completely on this problem, and through various testing we concluded that it was a combination of using the ReadBatchAsync method in the SDK and having rather large messages that was the cause of this issue. Switching to using OnMessageAsync fixed it.

c#azurequeueazure-web-app-serviceservicebus

Reputation: 1071

How to fix being limited to 1 client reading from Azure ServiceBus

I have something that's been quite the head scratcher for me...

I have a small service that reads messages from an Azure ServiceBus queue and stores the data in a CosmosDB collection.

The problem is that I can't get my service to scale. I have been able to optimize things to improve the number of messages read per second for one instance of the service. However, adding more instances of the service slightly degrades the number of messages read per second in total!

It's important to note that sending messages to the queue in batches works like a charm, I can send 1000-2000 messages per second to the queue without any issues. Reading from the queue is the issue.

My handler is slightly CPU intensive, and the messages range from approximately 2 KB to 900 KB in size, the average being somewhere around 25 KB. I've gotten one instance to handle approximately 41.5 messages per second now.

If I add a second instance of the service (which is an Azure Web App by the way) the total number of messages read per second for all instances drops to approximately 40. Adding yet another instance decreases it to closer to 38.

The actual code that reads messages from the queue (and handles retries, deadlettering etc.) is part of an internal company framework, which a lot of other services use, none of which have this issue. Other services have the expected behavior that performance scales linearly with the number of service instances (up to the max that ServiceBus can handle, obviously).

I have the same problem on two different Azure subscriptions (TEST and PROD) which both use the Premium ServiceBus tier.

I am not using sessions on the queue.

Has anyone here ever had a similar issue, and how did you solve it?

Things I've tried:

Switching out only the code that has to do with getting things off of ServiceBus with code reading from blob storage instead. This gave me orders of magnitude better throughput to CosmosDB (approximately 15,000 documents written pr. second vs. 41.6 when the source was ServiceBus). I got such high throughput by scaling out, which is what I'm having issues with when using ServiceBus So CosmosDb is definitely not the bottleneck.
I have tried deleting and recreating the queue as well as tweaking various things in my code - but logically, it seems to me, that whatever I do in my code should only be able to impact the performance of a single instance of the service.
I have tried running the service locally on my computer while it was also running in Azure. From Splunk logs I could then see that during a representative minute the Azure instance of the service handled 1371 messages, while the local instance of the service only handled 23 messages. So, as I've been saying, there seems to be some kind of deadlock or something going on here. Further analysis shows that the instance in Azure spent an average of 247 milliseconds processing a message, while the local instance spent an average of 66 seconds! If a lock expires or an unhandled exception otherwise occurs, a message is put back on the queue, and deadlettered after 10 failed delivery attempts. So it seems maybe most of the messages processed locally failed and were put back on the queue and then processed by the Azure instance eventually (this is my guess).

The only shared resources between instances of my web app are ServiceBus and CosmosDb, and as noted above, I've ruled out CosmosDb. However, seeing as I'm having the same issue in both our TEST and PROD subscriptions (our DEV subscription doesn't allow scaling out), and I've tried recreating the queue a few times in various different ways, it can't be the queue itself either, and none of the other queues in use on the same ServiceBus instance are having this issue.

Tweaking/Optimizing code has, as expected, only had impact on the performance of one instance. The possible, as far as I can tell, external bottlenecks have been ruled out. The one remaining thing, our internal framework which handles the actual reading of the messages from the queue, has also been ruled out by the fact that the exact same version of the framework is used in many other web apps where scaling out has been demonstrated to work.

I feel pretty check-mated out here...

SOLUTION: Forgot to update this question, so at last here it is... We eventually managed to set aside time to focus completely on this problem, and through various testing we concluded that it was a combination of using the ReadBatchAsync method in the SDK and having rather large messages that was the cause of this issue. Switching to using OnMessageAsync fixed it.

Upvotes: 3

Answers (2)

Ilya Chernomordik

Reputation: 30335

I would suggest to first eliminate the possibility that it is the handling code that is the problem. Try running with a dummy StartProcessMessage that does nothing to ensure that it is not the problem/bottleneck i.e. too many writers write to some shared resource or something similar.

Another option you can try is using the latest .Net library Microsoft.Azure.ServiceBus. The classes that are available there allows for running a built in loop that allows MaxConcurrentCalls in a more natural way and easy way. But ensuring it's not the handler is the first thing you should try. If you already did maybe you should share it.

Upvotes: 1

Nkosi

Reputation: 247451

It is usually not a good idea to have async void operations.

Additionally, you could also refactor the processing to be invoked in batches as well.

First approach assumes an inability to make StartProcessMessage async

void StartProcessMessage(Message m) {
    //...
}

public async Task Start() {
    while (true) {
        var messages = (await _queueClient.ReceiveBatchAsync(Math.Max(1, _configuration.MaxConcurrentCalls - _messagesInProgress))).ToArray();
        Interlocked.Add(ref _messagesInProgress, messages.Length);
        var tasks = messages.Select(m => Task.Run(() => StartProcessMessage(m)));
        await Task.WhenAll(tasks); //process in parallel.
        while (_messagesInProgress > _configuration.MaxConcurrentCalls) {
            await Task.Delay(100);
        }
    }
}

The second approach assumes that StartProcessMessage can be refactored to be async

Task StartProcessMessage(Message m) {
    //...
}

public async Task Start() {
    while (true) {
        var messages = (await _queueClient.ReceiveBatchAsync(Math.Max(1, _configuration.MaxConcurrentCalls - _messagesInProgress))).ToArray();
        Interlocked.Add(ref _messagesInProgress, messages.Length);
        var tasks = messages.Select(m => StartProcessMessage(m));
        await Task.WhenAll(tasks); //process in parallel.
        while (_messagesInProgress > _configuration.MaxConcurrentCalls) {
            await Task.Delay(100);
        }
    }
}

Upvotes: 1

How to fix being limited to 1 client reading from Azure ServiceBus

Answers (2)

Related Questions