Adrian Grassl
Adrian Grassl

Reputation: 83

Event Hubs data to a SQL Data Warehouse (Synapse)

we are trying to integrate Event Hub (EH) data (~ 200MB and 50k messages per minute) to a SQL Data Warehouse (DW) staging area.

So far we have tried to solve this by directly reading the EH data with an Azure Function (AF) and output it to Synapse, but we are reaching the maximum concurrent open sessions of the DW (512 for < DWU500c). We´ve also tried to increase the maxBatchSize, which is read from the EH, in order to decrease the needed sessions on the DW side, but this seems to make the AF quite unstable.

Today I found this tutorial in the Microsoft documentation, which decouples the EH from the AF by using the EH capture feature and Event Grid (EG) to trigger the AF whenever a Blob file has been written. So am I right in assuming that this should drastically decrease the max concurrent open sessions in the DW, since we are talking about much larger batch sizes, when the AF reads the captured Blob files, which can be up to 500MB in size?

Which advantages are there between one solution and the other? Do you have any other best practices to achieve this?

Thx in advance!

Upvotes: 1

Views: 1331

Answers (1)

Kashyap
Kashyap

Reputation: 17411

Cost

In general I don't like when they ask me to push data from one Queue to another Queue then some storage before the final destination. It almost seems like you pay extra extra for no reason.

In past they've tried to hawk Data Factory to us as well to load data from Datalake to Synapse. Azure Functions are ridiculously cheaper compared to Data factory.

Don't take Azure prescribed "patterns" on face value, they're great as a source to get ideas on how to integrate, but I would check the cost and try to optimize.


Cheapest I can think of:

If you have control over the source of your data, then consider writing it to csv/parque/orc files in datalake instead.

Then you can use Timer Triggered Azure Functions to load them into Synapse periodically using COPY INTO.


Future:

AFAI understand the direction of architecture (I don't work for Microsoft and have no official knowledge), they're trying to push COPY command as the future of Polybase loading (which is the preferred way to load data into Synapse/DW).

In general you would want to dig/ask around to find out which sub-features of something in preview are already GA. E.g. Synapse itself is in preview and expected to be GA by EO 2020 but Synapse Pool is GA. COPY INTO is in preview, but support to load csv files using COPY INTO is GA and so on...

They've also launched Azure Stream Analytics, again it's in preview but seems like the future for streaming data "into Synapse" or "from EventHub" among various sources and destinations it support. Here are some solution patterns.

HTH

Upvotes: 2

Related Questions