Best practice for layer concept with ADF and Databricks

Question

I want to build a data warehouse-like layer concept using Azure Data Factory and Databricks, for example Ingestion Layer, Propagation Layer, and Data Mart Layer. The idea was to create separate Databricks scripts for each layers transformations etc. and then orchestrate all of this in ADF pipelines. However, the challenge is then how to orchestrate the data loads from/to Databricks for each step, especially handling databricks in-memory data models, and handover to persistent storages for each layer (e.g. to an Azure SQLDB). If I need to load everything again to Databricks for each layer processing, this will result in a lot of I/O overhead and slow processing times. However, if I instead keep everything in Databricks until the final layer for processing, it will be hard to keep track of pipeline errors from within ADF, and to reprocess specific layers.

I am looking for best practice how to handle a layer concept with ADF and Databricks, Key Design Principles, or similar. Thanks in advance!

Nadine Raiss · Accepted Answer

If you are going to build a lakehouse architecture (Delta Lake Architecture), you should have a Data Lake Storage Gen 2 resource to store all of your data (parquet format ideally). The first ingestion will be raw data (Bronze zone). The second one will have a more refined/filtered view of the data (Silver zone). Lastly, the third one will provide business level aggregates which are going to be used for reporting and dashboarding (Gold Zone).

For your process, you should first use Azure Data Factory to connect to your data sources and load the raw data in your Data Lake Storage container (Copy activity in your ADF pipeline). Then, you will refine/transform your data into Bronze, Silver, and Gold tables with Azure Databricks and Delta Lake.

Links:

Best practice for layer concept with ADF and Databricks

Answers (1)

Related Questions