cget
cget

Reputation: 410

A practical example of denormalization in a SQL database?

I've been reading about denormalization for the last 20 minutes but can't get a concise example with code.

Is this what denormalization is?


1. We have a normalized database:

Table_1:
customer_id (Primary key)
country
city
street
house_number

Table_2:
product_id (Primary Key)
customer_id (Foreign key)
product_storage_building

Table_3:
product_id (Foreign Key)
product_name
product_color
product_origin

  1. However, joining all three tables is taking far too long to run let's say

        SELECT a.*, b.*, c.*
        FROM 
        TABLE_1 AS a
        LEFT JOIN TABLE_2 AS b
        ON a.customer_id = b.customer_id
        LEFT JOIN TABLE_3 AS c
        ON b.product_id = c.product_id
    

So I create a new table out of Table_1 and Table_2

    CREATE OR REPLACE TABLE Denormalized_Data AS
    (
     SELECT customer_id, 
            country, 
            city,
            street, 
            house_number,
            product_id,
            product_storage_building
     FROM Table_1
          LEFT JOIN Table_2
          ON Table_1.cusomter_id = Table_2.customer_id
    )
  1. Then join to Table_3 as follows

     SELECT customer_id, 
            country, 
            city,
            street, 
            house_number,
            product_storage_building,
            Denormalized_Data.product_id
            product_name,
            product_color,
         FROM Denormalized_Data
         LEFT JOIN Table_3
         ON Denormalized_Data.product_id = Table_3.product_id
    

Now this will make my query run faster - can the whole process above be described as denormalization?

Thanks

Upvotes: 6

Views: 27846

Answers (4)

rtenha
rtenha

Reputation: 3618

Consider the image below. The top contains several distinct tables that encapsulate logically separate bits of info. The bottom shows the results of those tables joined together. This is denormalization.

Normalized to Denormalized

In the case of BigQuery, and especially using BQ as a backend for a BI platform, denormalized data provides for a quicker user experience because it doesn't have to do the joins when a user hits 'run'.

If you leave the tables as is, if a user needs several of the fields, you might end up doing up to 7 joins and then doing aggregations (sums, counts, etc). However, if you do all 7 joins and store it in 1 table, then the user would be querying only 1 table and doing only aggregations. This is the power of BigQuery. It is scalable, so grouping and aggregating on huge columns of data is relatively 'easy' compared to joins, making the end user experience much faster.

People/companies that go in this direction typically do so in ETL processes (commonly overnight), so joins only have to happen 1 time (when users typically aren't using the database), then during the day, users and BI tools are just aggregating and slicing data without the joins! This does result in 'redundant' data and does incur extra storage costs, but often is worth it for downstream user experience

Upvotes: 9

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173046

Below is BigQuery specific answer!

BigQuery performs best when your data is denormalized. Rather than preserving a relational schema such as a star or snowflake schema, you can improve performance by denormalizing your data and taking advantage of nested and repeated fields. Nested and repeated fields can maintain relationships without the performance impact of preserving a relational (normalized) schema.

The storage savings from normalized data are less of a concern in modern systems. Increases in storage costs are worth the performance gains from denormalizing data. Joins require data coordination (communication bandwidth). Denormalization localizes the data to individual slots so execution can be done in parallel.

If you need to maintain relationships while denormalizing your data, use nested and repeated fields instead of completely flattening your data. When relational data is completely flattened, network communication (shuffling) can negatively impact query performance.

For example, denormalizing an orders schema without using nested and repeated fields may require you to group by a field like order_id (when there is a one-to-many relationship). Because of the shuffling involved, grouping the data is less performant than denormalizing the data using nested and repeated fields.

Note: In some circumstances, denormalizing your data and using nested and repeated fields may not result in increased performance.

You can see more in Denormalize data whenever possible section of BigQuery docs

Finally: BigQuery doesn't require a completely flat denormalization. You can use nested and repeated fields to maintain relationships.

Below is an example of producing denormalized table out of initial three normalized tables in your question

#standardSQL
SELECT ANY_VALUE(c).*,
  ARRAY_AGG((SELECT AS STRUCT p.*, s.product_storage_building)) products
FROM `project.dataset.customers` c
LEFT JOIN `project.dataset.storage` s USING (customer_id)
LEFT JOIN `project.dataset.products` p USING (product_id)
GROUP BY FORMAT('%t', c)

this will produce table with below schema

Obviously, this is more customer-focused schema. Depends on your needs you can similarly create product centric one. Or actually both and use appropriate based on use case

enter image description here

You can test, play with above using dummy data as in below example

#standardSQL
WITH `project.dataset.customers` AS (
  SELECT 1 customer_id, 'country 1' country, 'city 1' city, 'street 1' street, 1 house_number UNION ALL
  SELECT 2, 'country 1', 'city 2', 'street 2', 2 UNION ALL
  SELECT 3, 'country 1', 'city 3', 'street 3', 3 UNION ALL
  SELECT 4, 'country 2', 'city 4', 'street 4', 4 UNION ALL
  SELECT 5, 'country 2', 'city 5', 'street 5', 5 
), `project.dataset.products` AS (
  SELECT 1 product_id, 'product 1' product_name, 'color 1' product_color, 'origin 1' product_origin UNION ALL
  SELECT 2, 'product 2', 'color 2', 'origin 2' UNION ALL
  SELECT 3, 'product 3', 'color 3', 'origin 3' UNION ALL
  SELECT 4, 'product 4', 'color 4', 'origin 4' 
), `project.dataset.storage` AS (
  SELECT 1 product_id, 1 customer_id, 'building 1' product_storage_building UNION ALL
  SELECT 2, 1, 'building 1' UNION ALL
  SELECT 3, 1, 'building 1' UNION ALL
  SELECT 2, 2, 'building 2' UNION ALL
  SELECT 3, 2, 'building 3' UNION ALL
  SELECT 4, 2, 'building 3' UNION ALL
  SELECT 1, 3, 'building 1' UNION ALL
  SELECT 3, 3, 'building 1' 
)
SELECT ANY_VALUE(c).*,
  ARRAY_AGG((SELECT AS STRUCT p.*, s.product_storage_building)) products
FROM `project.dataset.customers` c
LEFT JOIN `project.dataset.storage` s USING (customer_id)
LEFT JOIN `project.dataset.products` p USING (product_id)
GROUP BY FORMAT('%t', c)    

with output

enter image description here

Upvotes: 2

Bill Karwin
Bill Karwin

Reputation: 562611

Yes, you're showing one type of denormalization.

There are three types of denormalization:

  • Join rows from different tables, so you don't have to use queries with JOIN.
  • Perform aggregate calculations like SUM() or COUNT() or MAX() or others, so you don't have to use queries with GROUP BY.
  • Pre-calculate expensive calculations, so you don't have to use queries with complex expressions in the select-list.

You're showing an example of the first type. At least you can avoid one of the two joins you intend to do.

Why not make the denormalized table store the result of joining all three tables?

What's the downside of using denormalization? You are now storing data redundantly: once in the normalized tables, and a copy in the denormalized table. Suppose you get into work tomorrow and find that the data in these different tables doesn't exactly match up. What happened?

  • Maybe someone inserted a row to the normalized tables without adding the corresponding data to the denormalized table.
  • Maybe someone deleted a row from the normalied tables, without deleting the corresponding row from the denormalized table.
  • Maybe someone inserted or deleted a row in the denormalized table, without the corresponding change in the normalized table.

How can you tell what happened? Which table is "correct"? This is the risk of denormalization.

Upvotes: 11

Gary Kephart
Gary Kephart

Reputation: 4994

Yes, that's the basics of normalization: providing a separate table for repeated data and using a foreign key to reference the new table's primary key. I would probably use CREATE VIEW instead of CREATE TABLE for querying, but in general, using views instead of tables to fetch read-only data is better. I would probably create a view like this:

CREATE VIEW view_2 AS 
SELECT ...
FROM table_2 t2
LEFT JOIN table_1 t1 ON t1.customer_id = t2.customer_id
LEFT JOIN table_3 t3 ON t3.product_id = t2.product_id;

Upvotes: 0

Related Questions