Reputation: 116

Correct appreoach to consume and prepare data from multiple sources power BI

I'm trying to establish if my planned way of working is correct.

I have two data sources; a MySql & MSSQL database. I need to combine these data sources and expose this data for Power BI to consume.

I've decided to use Azure Synapse Analytics for the ETL and would like to understand if there is anything in the process I can simplify or do better.

The process is as followed:

MySql & MSSQL delta loaded into ASA as parquet format, stored in Azure Gen 2 Storage. Once copy pipeline is complete a subsiquent data flow unions the data from the two sources and inserts into MSSQL storage in ASA. BI Consumes from this workspace / data soruce.

I'm not sure if I should be storing from the data sources to Azure Gene 2, or I should just perform the transform and insert from the source straight into the MSSQL storage. Any thoughts or suggestions would be greatly appreciated.

Upvotes: 1

Answers (2)

Himanshu Kumar Sinha

Reputation: 1786

Once copy pipeline is complete a subsiquent data flow unions the data from the two sources and inserts into MSSQL storage in ASA

What is the use MSSQL storage ? Is it only used by PowerBI to create reports , if yes then you can use ADLS gen2 , as it will be cheaper, ( basically very in line with Mark said above as "curated"

Just one more thing to consider , PowerBI can read data from both the sources and then do the transformation within itself.

Upvotes: 1

Mark Wojciechowicz

Reputation: 4477

The pattern that you're following is the data lake pattern, where data is moved between 3 zones:

Raw
Enriched
Curated

The Raw zone keeps an original copy of the data before transformation. The benefit of storing the data this way (as parquet files, here) is so that you can troubleshoot a problem with the transformation or create a different transformation to address a new need.

The Enriched zone is where you have done some transformation, like UNIONing your data, or providing some other clean up steps, maybe removing unneeded columns, correcting addresses, etc. You have done this by inserting the data into a SQL database, but this might also be accomplished by using views in the serverless pool, if the transformations are simple enough: https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/create-use-views

The Curated zone is a place to transform your data into a form that BI applications will do well with, i.e. a star schema. Even if this is a very simple dataset, it will be well worth incorporating a date dimension, which will yield a lot of benefits in Power BI. The bottom line here is that Power BI is optimized to work with star schemas, so that's what you should give it.

You do not need to use data lake technologies to follow this pattern and still get the benefits. As far as whether what you are doing is good will be based on how everything performs versus how simple you can keep it. Here's more on the topic: https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/scenarios/cloud-scale-analytics/best-practices/data-lake-overview

Upvotes: 3

Correct appreoach to consume and prepare data from multiple sources power BI

Answers (2)

Related Questions