Reputation: 410
I've been reading about denormalization for the last 20 minutes but can't get a concise example with code.
Is this what denormalization is?
1. We have a normalized database:
Table_1:
customer_id (Primary key)
country
city
street
house_number
Table_2:
product_id (Primary Key)
customer_id (Foreign key)
product_storage_building
Table_3:
product_id (Foreign Key)
product_name
product_color
product_origin
However, joining all three tables is taking far too long to run let's say
SELECT a.*, b.*, c.*
FROM
TABLE_1 AS a
LEFT JOIN TABLE_2 AS b
ON a.customer_id = b.customer_id
LEFT JOIN TABLE_3 AS c
ON b.product_id = c.product_id
So I create a new table out of Table_1
and Table_2
CREATE OR REPLACE TABLE Denormalized_Data AS
(
SELECT customer_id,
country,
city,
street,
house_number,
product_id,
product_storage_building
FROM Table_1
LEFT JOIN Table_2
ON Table_1.cusomter_id = Table_2.customer_id
)
Then join to Table_3 as follows
SELECT customer_id,
country,
city,
street,
house_number,
product_storage_building,
Denormalized_Data.product_id
product_name,
product_color,
FROM Denormalized_Data
LEFT JOIN Table_3
ON Denormalized_Data.product_id = Table_3.product_id
Now this will make my query run faster - can the whole process above be described as denormalization?
Thanks
Upvotes: 6
Views: 27846
Reputation: 3618
Consider the image below. The top contains several distinct tables that encapsulate logically separate bits of info. The bottom shows the results of those tables joined together. This is denormalization.
In the case of BigQuery, and especially using BQ as a backend for a BI platform, denormalized data provides for a quicker user experience because it doesn't have to do the joins when a user hits 'run'.
If you leave the tables as is, if a user needs several of the fields, you might end up doing up to 7 joins and then doing aggregations (sums, counts, etc). However, if you do all 7 joins and store it in 1 table, then the user would be querying only 1 table and doing only aggregations. This is the power of BigQuery. It is scalable, so grouping and aggregating on huge columns of data is relatively 'easy' compared to joins, making the end user experience much faster.
People/companies that go in this direction typically do so in ETL processes (commonly overnight), so joins only have to happen 1 time (when users typically aren't using the database), then during the day, users and BI tools are just aggregating and slicing data without the joins! This does result in 'redundant' data and does incur extra storage costs, but often is worth it for downstream user experience
Upvotes: 9
Reputation: 173046
Below is BigQuery specific answer!
BigQuery performs best when your data is denormalized. Rather than preserving a relational schema such as a star or snowflake schema, you can improve performance by denormalizing your data and taking advantage of nested and repeated fields. Nested and repeated fields can maintain relationships without the performance impact of preserving a relational (normalized) schema.
The storage savings from normalized data are less of a concern in modern systems. Increases in storage costs are worth the performance gains from denormalizing data. Joins require data coordination (communication bandwidth). Denormalization localizes the data to individual slots so execution can be done in parallel.
If you need to maintain relationships while denormalizing your data, use nested and repeated fields instead of completely flattening your data. When relational data is completely flattened, network communication (shuffling) can negatively impact query performance.
For example, denormalizing an orders schema without using nested and repeated fields may require you to group by a field like order_id (when there is a one-to-many relationship). Because of the shuffling involved, grouping the data is less performant than denormalizing the data using nested and repeated fields.
Note: In some circumstances, denormalizing your data and using nested and repeated fields may not result in increased performance.
You can see more in Denormalize data whenever possible section of BigQuery docs
Finally: BigQuery doesn't require a completely flat denormalization. You can use nested and repeated fields to maintain relationships.
Below is an example of producing denormalized table out of initial three normalized tables in your question
#standardSQL
SELECT ANY_VALUE(c).*,
ARRAY_AGG((SELECT AS STRUCT p.*, s.product_storage_building)) products
FROM `project.dataset.customers` c
LEFT JOIN `project.dataset.storage` s USING (customer_id)
LEFT JOIN `project.dataset.products` p USING (product_id)
GROUP BY FORMAT('%t', c)
this will produce table with below schema
Obviously, this is more customer-focused schema. Depends on your needs you can similarly create product centric one. Or actually both and use appropriate based on use case
You can test, play with above using dummy data as in below example
#standardSQL
WITH `project.dataset.customers` AS (
SELECT 1 customer_id, 'country 1' country, 'city 1' city, 'street 1' street, 1 house_number UNION ALL
SELECT 2, 'country 1', 'city 2', 'street 2', 2 UNION ALL
SELECT 3, 'country 1', 'city 3', 'street 3', 3 UNION ALL
SELECT 4, 'country 2', 'city 4', 'street 4', 4 UNION ALL
SELECT 5, 'country 2', 'city 5', 'street 5', 5
), `project.dataset.products` AS (
SELECT 1 product_id, 'product 1' product_name, 'color 1' product_color, 'origin 1' product_origin UNION ALL
SELECT 2, 'product 2', 'color 2', 'origin 2' UNION ALL
SELECT 3, 'product 3', 'color 3', 'origin 3' UNION ALL
SELECT 4, 'product 4', 'color 4', 'origin 4'
), `project.dataset.storage` AS (
SELECT 1 product_id, 1 customer_id, 'building 1' product_storage_building UNION ALL
SELECT 2, 1, 'building 1' UNION ALL
SELECT 3, 1, 'building 1' UNION ALL
SELECT 2, 2, 'building 2' UNION ALL
SELECT 3, 2, 'building 3' UNION ALL
SELECT 4, 2, 'building 3' UNION ALL
SELECT 1, 3, 'building 1' UNION ALL
SELECT 3, 3, 'building 1'
)
SELECT ANY_VALUE(c).*,
ARRAY_AGG((SELECT AS STRUCT p.*, s.product_storage_building)) products
FROM `project.dataset.customers` c
LEFT JOIN `project.dataset.storage` s USING (customer_id)
LEFT JOIN `project.dataset.products` p USING (product_id)
GROUP BY FORMAT('%t', c)
with output
Upvotes: 2
Reputation: 562611
Yes, you're showing one type of denormalization.
There are three types of denormalization:
JOIN
.SUM()
or COUNT()
or MAX()
or others, so you don't have to use queries with GROUP BY
.You're showing an example of the first type. At least you can avoid one of the two joins you intend to do.
Why not make the denormalized table store the result of joining all three tables?
What's the downside of using denormalization? You are now storing data redundantly: once in the normalized tables, and a copy in the denormalized table. Suppose you get into work tomorrow and find that the data in these different tables doesn't exactly match up. What happened?
How can you tell what happened? Which table is "correct"? This is the risk of denormalization.
Upvotes: 11
Reputation: 4994
Yes, that's the basics of normalization: providing a separate table for repeated data and using a foreign key to reference the new table's primary key. I would probably use CREATE VIEW instead of CREATE TABLE for querying, but in general, using views instead of tables to fetch read-only data is better. I would probably create a view like this:
CREATE VIEW view_2 AS
SELECT ...
FROM table_2 t2
LEFT JOIN table_1 t1 ON t1.customer_id = t2.customer_id
LEFT JOIN table_3 t3 ON t3.product_id = t2.product_id;
Upvotes: 0