Enrique Benito Casado
Enrique Benito Casado

Reputation: 2080

How to handle denormalization in B.I data model

As far as I know, normalization is done to avoid inconsistency in the database.

By normalizing we:

That's why most OLTP databases are in 3NF.

Different databases from OLTP come together in a data warehouse. (DWH, OLAP). DWHs are denormalized (1FN), and is obvious it has to be like that, because the main table of a DWH has hundreds of columns.

From that DWH we can build several data marts that we would later use for doing analysis with a BI reporting tool (Cognos, QlikView, BO .. )

The problem is that the data model for the BI report is not normalized.

Couldn't that be a problem for redundancy and data integrity for the report?

Upvotes: 1

Views: 1275

Answers (2)

AlexYes
AlexYes

Reputation: 4208

Regarding redundancy: some data warehouse engines like Amazon Redshift allow data compression that is very handy for denormalization. Let's say you have a table of sales events with 100M records and every sale has a city. In OLTP data model, you would have sales and cities with city_id connecting them. In OLAP data model with compression allowed it's much easier to have sales table with a text city attribute compressed. You'll be able to calculate sales by city without joining tables, and your city values won't occupy much disk space because they will be encoded.

More info about compression is in Amazon docs: Choosing a Column Compression Type

Regarding data integrity: you have to design your ETL routines to minimize the possibility of duplicate data and also run scheduled checks for duplicates based on criteria like this:

select count(*) from table;
select count(*) from (select distinct <expression> from table);

where is a list of columns which combination should be unique (your artificial primary key).

Upvotes: 2

Dimgold
Dimgold

Reputation: 2944

In OLAP systems (such as data warehouses), the key efficiency needs are in querying and data retrieval.

Therefore some of the design considerations are done in order to retrieve the information faster, even if the updates may be longer.

An example for such a model is a Star-Schema on which we denormalize data in a such way that all the data will be stored in a 1-join-hop distance.

Key elements such as transactions are located at the big table (Facts), with foreign keys pointing at the dimensions.

The dimension themselves are smaller, and may contain not-normalized data. For example an address dimension may store street, neighborhood and city data without normalizing it to 3NF.

Star Schema example

There are for sure redundancy issues (you don't really have to store Day_of_Week per each date row) but it is insignificant (since storage is not a bottleneck in this scenario).

As per integrity - you face it only on updates (F.E. a less-realistic scenario of country change per State_Province in Dim_Store) , and in DWH update is a rare-case, where we allow ourselves to be inefficient.

Moreover - integrity is not enforced by the DB (or normalization) but by design and implementation of the ETL process.

Read more on Data Warehouses modeling

Upvotes: 3

Related Questions