mysqldatabase-designrelational-databasequery-optimization

Reputation: 2273

Relational database (RDBMS) denormalized data

I believe this question does not address specifically MySQL - which is the database that I'm using -, and it's one about best practices.

Up until now, my problems could be solved by creating tables and querying them (sometimes JOINing here and there). But there is something that I'm doing that doesn't feel right, and it triggers me whenever I need a denormalized data alongside my "common" queries.

Example Use-case

So that I can express myself better, let's create a superficial scenario where:

a user can buy a product, generating a purchase (let's ignore the fact that the purchase can only have a single product);
and we need to query the products with the total amount of times it has been purchased;

To solve our use-case, we could define a simple structure made by:

product table:
- product_id [INT PK]
user table:
- user_id [INT PK]
purchase table:
- purchase_id [INT PK]
- product_id [INT FK NOT NULL]
- user_id [INT FK NOT NULL]

Here is where it doesn't feel right: When we need to retrieve a list of products with the total times it has been purchased, I would create the query:

# There are probably faster queries than this to reach the same output
SELECT
    product.product_id,
    (SELECT COUNT(*) FROM purchase
      WHERE purchase.product_id = product.product_id)
FROM
    product

My concern origin is that I've read that COUNT does a full table scan, and it scares me to perform the query above when scaled to thousands of products being purchased - even though I've created an INDEX with the product_id FK on purchase (MySQL does this by default).

Possible solutions

My knowledge with relational databases is pretty shallow, so I'm kind of lost when comparing what are the alternatives (plausible ones) for these kinds of problems. To not say that I haven't done my homework (searching before asking), I've found plausible to:

Create Transactions:

When INSERTing a new purchase, it must always be inside a transaction that also updates the product table with the purchase.product_id.

Possible Problems: human error. Someone might manually insert a purchase without doing the transaction and BAM - we have an inconsistency.

Create Triggers:

Whenever I insert, delete or update some row in some specific table, I would update my products table with a new value (bought_amount). So the table would become:

product table:
- product_id [INT PK]
- bought_amount [INT NOT NULL];

Possible problems: are triggers expensive? is there a way that the insertion succeeds but the trigger won't - thus leaving me with an inconsistency?

Question

Updating certain tables to store data that constantly changes is a plausible approach with RDBMSs? Is it safer and - in the long term - more beneficial to just keep joining and counting/summing other occurrences?

I've found a couple of useful questions/answers regarding this matter, but none of them addressed this subject in a wide perspective. Please take into consideration my ignorance about RDBMS, as I may be suggesting nonsense Possible solutions.

Upvotes: 3

Answers (3)

Rick James

Reputation: 142298

Possible Problems: human error. Someone might manually insert a purchase without doing the transaction and BAM - we have an inconsistency.

--> Build a Stored Procedure that does both steps in a transaction, then force users to go through that.

Possible problems: are triggers expensive? is there a way that the insertion succeeds but the trigger won't - thus leaving me with an inconsistency?

Triggers are not too bad. But, again, I would recommend forcing users through a Stored Procedure that does all the desired steps.

Note: Instead of Stored Procedures, you could have an application that does the necessary steps; then force users to go through the app and give them no direct access to the database.

A database is the "source of truth" on the data. It is the "persistent" repository for such. It should not be considered the entire engine for building an application.

As for performance:

Summing over a million rows may take a noticeable amount of time.
You can easily do a hundred single-row queries (select/insert/update) per second.
Please think through numbers like that.

Upvotes: 0

AntC

Reputation: 2806

The usual way to get a count per key is

SELECT product_id, COUNT(*)
FROM purchase
GROUP BY product_id

You don't need to mention the product table, because all it contains is the key column. Now although that uses COUNT(*), it doesn't need a full table scan for every product_id because the SQL engine is smart enough to see the GROUP BY.

But this produces a different result to your query: for products that have never been purchased, my query simply won't show them; your query will show the product_id with count zero.

Then before you start worrying about implementation and efficiency, what question(s) are you trying to answer? If you want to see all products whether purchased or not, then you must scan the whole product table and look up from that to purchase. I would go

SELECT product_id, count
FROM product
OUTER JOIN (SELECT product_id, COUNT(*) AS count
            FROM purchase
            GROUP BY product_id) AS purch
ON product.product_id = purch.product_id

As regards your wider questions (not sure I fully understand them), in the early days SQL was quite inefficient at this sort of joining and aggregating, and schema often were denormalised with repeated columns in multiple tables. SQL engines are now much smarter, so that's not necessary. You might see that old-fashioned practice in older textbooks. I would ignore it and design your schema as normalised as possible.

Upvotes: 2

Gordon Linoff

Reputation: 1269873

This query:

SELECT p.product_id,
      (SELECT COUNT(*)
       FROM purchase pu
       WHERE pu.product_id = p.product_id
      )
FROM product p;

has to scan both product and purchase. I'm not sure why you are emotional about one table scan but not the other.

As for performance, this can take advantage of an index on purchase(product_id). In MySQL, this will probably be faster than the equivalent (left) join version.

You should not worry about performance of such queries until that becomes an issue. If you need to increase performance of such a query, first I would ask: Why? That is a lot of information being returned -- about all products over all time. More typically, I would expect someone to care about one product or a period of time or both. And, those concerns would suggest the development of a datamart.

If performance is an issue, you have many alternatives, such as:

Defining a data mart to periodically summarize the data into more efficient structures for such queries.
Adding triggers to the database to summarize the data, if the results are needed in real-time.
Developing a methodology for maintaining the data that also maintains the summaries, either at the application-level or using stored procedures.

What doesn't "feel right" to you is actually the tremendous strength of a relational database (with a reasonable data model). You can keep it up-to-date. And you can query it using a pretty concise language that meets business needs.