TechCowboy
TechCowboy

Reputation: 51

Dealing with two fact granularities - Dimensional Model

I have a question about creating a dimensional model and dealing with different levels of granularity.

I would like to know which out of these two approaches would be best and why. Or if there is another approach which would be even better.

The scenario I'm usinging is simple: I have 2 dimensions, Region and Customer and 1 fact, Sales.

This becomes two dimension tables, one for region and another for customer with a fact table containing sales looking like so:

enter image description here

Now I want to aggregate the sales by region. But I'm not sure which is the best approach.

Should I aggregate the sales by region and then join the data onto the fact table so the the model looks like this:

enter image description here

Or should I create a new table which holds the aggregated values with a key joining back to the fact and region dimension table which would look like this:

enter image description here

Or is there another approach which beats these two?

Your wisdom and input is appreciated.

Thanks

Upvotes: 1

Views: 2200

Answers (1)

Rich
Rich

Reputation: 2279

Your first diagram allows you to query the fact to aggregate by region, but I assume you're looking to pre-aggregate region-level results for reasons of performance.

The standard technique for aggregated facts is to create a separate fact table at the level you need, that supplements the main fact. Ideally you'd have a query tool that knew when you would benefit from using the aggregated fact.

The aggregated fact would have only RegionKey and Sales in (i.e. a foreign key to the region dimension). This is similar to your second solution, but there's no link to the fact from which the figures have been aggregated. There's no need: you can already see which detailed facts make up the aggregates from the main fact itself.

Your first solution 'mixes the grain' of the fact and isn't recommended. Fact tables should have a clearly stated grain so that you know what each row represents, e.g. a measurement of a sale to a customer. If you included the aggregated figure, this does not apply to one customer's sale (or just one customerkey), and you could get double counting if you didn't understand that when querying. Measures in a fact should ideally be 'additive' across all dimensions: you couldn't sum up the RegionSales aggregate figure against anything other than region.

However, features of modern BI tools and database systems have performance features that much reduce the need for aggregate facts. Columnar databases (or columnstore indexes on relational databases such as SQL Server), in memory relational data models such as those found in Power BI, all contribute to this kind of query running fast without special aggregate tables. This is important, because it can be a pain to keep your aggregate tables up to date and in sync with your underlying facts.

Upvotes: 5

Related Questions