Reputation: 2101
I have been tasked to design a Kimball-style data warehouse. It will sit on-prem in SQL Server. What is the best practice for the organization of the physical implementation? That is, should the data warehouse be a single database, using schemas to separate each of the data marts (and also putting all dimensions in their own schema, to help "drive" re-use across marts)? Or, should each data mart be its own database (forcing all dimensions to live in a separate database)?
Does the decision matter if I was to use a cloud platform for the data warehouse, say Azure SQL DB (e.g., use managed instance to allow for cross-database querying)?
Upvotes: 1
Views: 193
Reputation: 89211
Both a multi-schema and a multi-database design are common. The main benefits for a multi-database design is
The ability to have different backup/recovery model/HA for different parts of the DW.
An easier security, admin, and DevOps story for having different teams own different parts of the DW.
In Azure SQL Database (not managed instance), and Synapse SQL Pool (currently) first-class cross-database queries are not available, so you're more likely to use a multi-schema design there. Managed Instance has some cost-of-perf issues for data warehouse workloads, and Hyperscale typically a better fit for Data Warehouses in the 100GB-10TB range.
But all Managed Instance databases are in Full recovery with platform-managed backups. And in MI you can't share storage resources between databases, so a multi-database design may be more expensive.
So not a clear best-practice there. Columnstore for large fact tables is strong recommendation, combined with general Columnstore best-practices for loading and maintenance.
Upvotes: 2