Andy T
Andy T

Reputation: 9881

Slow Stream Analytics with blob input

I've inherited a solution that uses Stream Analytics with blobs as the input and then writes to an Azure SQL database.

Initially, the solution worked fine, but after adding several million blobs to a container (and not deleting old blobs), Stream Analytics is slow in processing new blobs. Also, it appears that some blobs are being missed/skipped.

Question: How does Stream Analytics know there are new blobs in a container?

Prior to EventGrid, Blob storage did not have a push notification mechanism to let Stream Analytics know that a new blob needs to be processed, so I'm assuming that Stream Analytics is polling the container to get the list of blobs (with something like CloudBlobContainer.ListBlobs()) and saves the list of blobs internally, so that when it goes to poll again it can compare the new list with the old list and know which blobs are new and need to be processed.

The documentation states:

Stream Analytics will view each file only once

However, besides that note, I have not seen any other documentation to explain how Stream Analytics knows which blobs to process.

Upvotes: 0

Views: 338

Answers (1)

Vignesh Chandramohan
Vignesh Chandramohan

Reputation: 1306

ASA uses list blobs to get list of blobs.

If you can partition the blob path by date time pattern, it would be better. ASA will only have to list a specific path to discover new blobs, without a date pattern, all blobs will have to be listed. This is probably why it gets slower with huge number of blobs.

Upvotes: 1

Related Questions