Reputation: 116
I'm trying to establish if my planned way of working is correct.
I have two data sources; a MySql & MSSQL database. I need to combine these data sources and expose this data for Power BI to consume.
I've decided to use Azure Synapse Analytics for the ETL and would like to understand if there is anything in the process I can simplify or do better.
The process is as followed:
MySql & MSSQL delta loaded into ASA as parquet format, stored in Azure Gen 2 Storage. Once copy pipeline is complete a subsiquent data flow unions the data from the two sources and inserts into MSSQL storage in ASA. BI Consumes from this workspace / data soruce.
I'm not sure if I should be storing from the data sources to Azure Gene 2, or I should just perform the transform and insert from the source straight into the MSSQL storage. Any thoughts or suggestions would be greatly appreciated.
Upvotes: 1
Views: 124
Reputation: 1786
Once copy pipeline is complete a subsiquent data flow unions the data from the two sources and inserts into MSSQL storage in ASA
What is the use MSSQL storage ? Is it only used by PowerBI to create reports , if yes then you can use ADLS gen2 , as it will be cheaper, ( basically very in line with Mark said above as "curated"
Just one more thing to consider , PowerBI can read data from both the sources and then do the transformation within itself.
Upvotes: 1
Reputation: 4477
The pattern that you're following is the data lake pattern, where data is moved between 3 zones:
The Raw
zone keeps an original copy of the data before transformation. The benefit of storing the data this way (as parquet files, here) is so that you can troubleshoot a problem with the transformation or create a different transformation to address a new need.
The Enriched
zone is where you have done some transformation, like UNIONing your data, or providing some other clean up steps, maybe removing unneeded columns, correcting addresses, etc. You have done this by inserting the data into a SQL database, but this might also be accomplished by using views in the serverless pool, if the transformations are simple enough: https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/create-use-views
The Curated
zone is a place to transform your data into a form that BI applications will do well with, i.e. a star schema. Even if this is a very simple dataset, it will be well worth incorporating a date dimension, which will yield a lot of benefits in Power BI. The bottom line here is that Power BI is optimized to work with star schemas, so that's what you should give it.
You do not need to use data lake technologies to follow this pattern and still get the benefits. As far as whether what you are doing is good will be based on how everything performs versus how simple you can keep it. Here's more on the topic: https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/scenarios/cloud-scale-analytics/best-practices/data-lake-overview
Upvotes: 3