Azure Stream Analytics: identify right sizing and best practice to optimize ingestion to DataLaand processing

Question

I have an Azure Event Hub Namespace based on a Standard tier with 17 TU's, which can also auto-inflate up to 40 TU's. It has 1 Event Hubs instance with 12 partitions.

This EH receives 2400 messages per second which is 5.6 MB/second. 100% of this input is currently consumed by other clients that I don't know. I don't know how partitioning is done, nor I can control it. I just know that the EH has 12 partitions, but I don't care because I created another consumer group to send another 100% of its incoming data to Azure Stream Analytics.

Concerning the message size, we have 13.8TB/6.08Billion = 3 KB per message, which is a quite complex json payload like this, where my goal is to extract and write into an Azure DataLake v2 only the records where "my_parameter" is included as a nested json field of the payload.

The records which match this rule are about 1/50, so this means that I should write about 50 messages per second, for a total size of 400KB/second in parquet format.

As suggested by the documentation, we started setting up 6 SU. As you can see, ASA in unable to handle the incoming flow; this is clearly shown - by the watermark, that starts growing minute by minute, after 29 minutes accumulated 20 minutes delay; - By the SU % utilization, which is constantly growing - Growing Backlogged Input events

The nr of incoming and outgoing (going into the DataLake) events is simply aligned to the other numbers above.

To improve this situation, I followed this article to re-shuffle or repartition) the 12 EH partitions, which I can't control. I didn't change the 6 ASA SU's (= same cost) but I forced 2 partitions to be 2, 6, 12 with this technique:

It seems that the timing improves a bit (although I'm not sure that 1 hour tests in enough), however the main problem remains.

After many tests, I had to move up to 32 SU's to be able to support the workload and sometimes even recover the past messages still waiting in the event hub instance. This is the result using 36 SU's in a shared cluster, but I verified that I get the same result even using a dedicated cluster with 36 SU's. In this case I started ASA after 37 minutes interruption, in fact we see the watermark initially at 37 and then slowly decreasing. This means that ASA is getting the oldest events first, which it's able to process quickly than the production speed. Concerning the other parameters, we see the CPU is close to 90% (which I don't care, right?), while SU% utilization is quite small (about 10%) and stable even if at the beginning it's managing running more quickly than production becuase it has to recover 37 minutes when it didn't work.

So, this looks like the solution to my problem. But I'm wondering why I should pay 36 units (theoretically able to process 36 MB/second), while I need to process just 6.7 MB/second? It's 6 times more expensive this way!

My question: is this the right approach to identify the right sizing? How can I justify a so high price for something that should cost 1 sixth?

Azure Stream Analytics: identify right sizing and best practice to optimize ingestion to DataLaand processing

Answers (1)

Related Questions