Reputation: 2273
I believe this question does not address specifically MySQL - which is the database that I'm using -, and it's one about best practices.
Up until now, my problems could be solved by creating tables and querying them (sometimes JOINing here and there). But there is something that I'm doing that doesn't feel right, and it triggers me whenever I need a denormalized data alongside my "common" queries.
So that I can express myself better, let's create a superficial scenario where:
user
can buy a product
, generating a purchase
(let's ignore the fact that the purchase
can only have a single product
);product
s with the total amount of times it has been purchase
d;To solve our use-case, we could define a simple structure made by:
product
table:
product_id
[INT PK]user
table:
user_id
[INT PK]purchase
table:
purchase_id
[INT PK]product_id
[INT FK NOT NULL]user_id
[INT FK NOT NULL]Here is where it doesn't feel right: When we need to retrieve a list of product
s with the total times it has been purchased, I would create the query:
# There are probably faster queries than this to reach the same output
SELECT
product.product_id,
(SELECT COUNT(*) FROM purchase
WHERE purchase.product_id = product.product_id)
FROM
product
My concern origin is that I've read that COUNT does a full table scan, and it scares me to perform the query above when scaled to thousands of products being purchased - even though I've created an INDEX with the product_id
FK on purchase
(MySQL does this by default).
My knowledge with relational databases is pretty shallow, so I'm kind of lost when comparing what are the alternatives (plausible ones) for these kinds of problems. To not say that I haven't done my homework (searching before asking), I've found plausible to:
When INSERTing a new purchase
, it must always be inside a transaction that also updates the product
table with the purchase.product_id
.
Possible Problems: human error. Someone might manually insert a purchase
without doing the transaction and BAM - we have an inconsistency.
Whenever I insert, delete or update some row in some specific table, I would update my product
s table with a new value (bought_amount
). So the table would become:
product
table:
product_id
[INT PK]bought_amount
[INT NOT NULL];Possible problems: are triggers expensive? is there a way that the insertion succeeds but the trigger won't - thus leaving me with an inconsistency?
Updating certain tables to store data that constantly changes is a plausible approach with RDBMSs? Is it safer and - in the long term - more beneficial to just keep joining and counting/summing other occurrences?
I've found a couple of useful questions/answers regarding this matter, but none of them addressed this subject in a wide perspective. Please take into consideration my ignorance about RDBMS, as I may be suggesting nonsense Possible solutions.
Upvotes: 3
Views: 308
Reputation: 142298
Possible Problems: human error. Someone might manually insert a purchase without doing the transaction and BAM - we have an inconsistency.
--> Build a Stored Procedure that does both steps in a transaction, then force users to go through that.
Possible problems: are triggers expensive? is there a way that the insertion succeeds but the trigger won't - thus leaving me with an inconsistency?
Triggers are not too bad. But, again, I would recommend forcing users through a Stored Procedure that does all the desired steps.
Note: Instead of Stored Procedures, you could have an application that does the necessary steps; then force users to go through the app and give them no direct access to the database.
A database is the "source of truth" on the data. It is the "persistent" repository for such. It should not be considered the entire engine for building an application.
As for performance:
Upvotes: 0
Reputation: 2806
The usual way to get a count per key is
SELECT product_id, COUNT(*)
FROM purchase
GROUP BY product_id
You don't need to mention the product
table, because all it contains is the key column. Now although that uses COUNT(*)
, it doesn't need a full table scan for every product_id
because the SQL engine is smart enough to see the GROUP BY
.
But this produces a different result to your query: for product
s that have never been purchased, my query simply won't show them; your query will show the product_id
with count zero.
Then before you start worrying about implementation and efficiency, what question(s) are you trying to answer? If you want to see all product
s whether purchased or not, then you must scan the whole product
table and look up from that to purchase
. I would go
SELECT product_id, count
FROM product
OUTER JOIN (SELECT product_id, COUNT(*) AS count
FROM purchase
GROUP BY product_id) AS purch
ON product.product_id = purch.product_id
As regards your wider questions (not sure I fully understand them), in the early days SQL was quite inefficient at this sort of joining and aggregating, and schema often were denormalised with repeated columns in multiple tables. SQL engines are now much smarter, so that's not necessary. You might see that old-fashioned practice in older textbooks. I would ignore it and design your schema as normalised as possible.
Upvotes: 2
Reputation: 1269873
This query:
SELECT p.product_id,
(SELECT COUNT(*)
FROM purchase pu
WHERE pu.product_id = p.product_id
)
FROM product p;
has to scan both product
and purchase
. I'm not sure why you are emotional about one table scan but not the other.
As for performance, this can take advantage of an index on purchase(product_id)
. In MySQL, this will probably be faster than the equivalent (left) join version.
You should not worry about performance of such queries until that becomes an issue. If you need to increase performance of such a query, first I would ask: Why? That is a lot of information being returned -- about all products over all time. More typically, I would expect someone to care about one product or a period of time or both. And, those concerns would suggest the development of a datamart.
If performance is an issue, you have many alternatives, such as:
What doesn't "feel right" to you is actually the tremendous strength of a relational database (with a reasonable data model). You can keep it up-to-date. And you can query it using a pretty concise language that meets business needs.
Upvotes: 2