Data Architecture - Full Azure Stack vs Integrated Delta Lake

Question

A friend's company is working on a Data architecture which to us, it seems to be rather convoluted and having several scalability and cost problems.

If possible, I would like to have your opinion on the old and proposed architectures (or alternatives), to discuss their pros and cons and potentially finding unforeseen problems/limitations.

Current Architecture - Azure Stack

Ingestion Layer

Multiple sources stored into Azure Data Lake Gen2 via Azure Databricks

Processing Layer

Azure Databricks cleans the data and stores them back into Azure Data Lake Gen2 into different sections: raw, clean

Loading Layer

Azure Databricks is used to load the data into a Azure SQL Server instance
Azure Synapse is used as the layer between Azure SQL Server and Azure Analysis Services

Presentation Layer

Data Models created in Azure Analysis Services and served into Power BI or Excel

Pros of this Approach

Models and optimisation (OLAP) generated via Azure Analysis Services
Fully integrated within the Azure ecosystem

Cons of this Approach

Vertically scalable: as the data grows, so will the Azure SQL Server used to store the Data Warehouse grow vertically, as will the Azure Analysis Services required to process said data
High Costs
Convoluted: a lot of services serving as glue, harder to manage/maintain
Always-on approach: both Azure SQL Server and Azure Analysis Services are required to be always on, representing unneeded expenses

Proposed Architecture - Azure w/ Delta Lake

The alternative architecture relies on the fact that Azure Databricks is already used within the ETL process and attempts to maximize its usage to provide horizontal scalability and Serverless resources.

Ingestion Layer

(same) Multiple sources stored into Azure Data Lake Gen2 via Azure Databricks

Processing and Loading Layers

(same) Azure Databricks cleans the data and stores them back into Azure Data Lake Gen2 into different sections: raw, clean
Azure Databricks generates the Data Warehouse using Delta Lake storing it directly in Azure Data Lake Gen2 creating Silver and Gold (Aggregation/Cubes) quality

Presentation Layer

Using Azure Synapse Analytics (Serverless) to designate access and querying capabilities directly on Azure Data Lake Gen2 which is then exposed to Power BI and Excel for Governance purposes

Pros of this Approach

Simpler
Horizontally scalable: the pipeline relies on Azure Delta Lake Gen2 (Parquet), which is naturally horizontally scalable; and Azure Synapse Analytics (Serverless) which, due to its Serverless pool, can also be considered horizontally scalable-ish.
Delta Lake provides a natural Audit layer and is easily integrated with existing Data Catalog solutions

Cons of this Approach

No Data Models to facilitate usage via Power BI and Excel
Critical to partition the Parquet files according to Azure Synapse Analytics's queries else we may incur in high transfer costs
No true OLAP capabilities

Data Architecture - Full Azure Stack vs Integrated Delta Lake

Answers (1)

Related Questions