Reputation: 13614
I'm new to SQL and I don't understand performance implications. It seems like SQL databases store everything in a single place. Doesn't this mean that tables grow extremely large, very quickly? Won't this hurt performance?
Example Stackoverflow model, but with threaded comments:
CREATE TABLE t_users (
name varchar(80) primary key,
email varchar(80)
);
CREATE TABLE t_posts (
id varchar(80) primary key,
userid varchar(80) references t_users(name),
title varchar(80),
description text,
topic varchar(80),
path text
);
Is this a valid design? All posts of every user ever are stored in the same table... So if I want to query for all comments that have topic "programming", it would need to look through every single post, even through the posts that have different topics because they are all stored in the same table....this also means that if I make more complicated queries, they will exponentially grow slower the bigger my table on disk is. Wouldn't it be better to split every single post into a new table?
Upvotes: 0
Views: 146
Reputation: 1269603
The real answer to your question first two questions are "yes" and "no". "Yes", tables do grow quickly. And "no" this generally doesn't hurt performance.
Although you can think of a table as a single file that you scan to find results, SQL does much more than this. Almost all databases manage something called a page table, which means the following:
These two points, by themselves, are pretty hard (but not impossible) to implement in most programming languages. In addition, SQL gives you indexes and table partitions which can further speed up processing of a single table. And finally, most versions of SQL support multiple processors/threads for almost all their functions.
When it comes to combining results from multiple tables or aggregating results, a database has many person-years of effort in algorithms for doing joins and aggregations. It is unlikely that you would do better in this situation.
And, these capabilities don't touch on many other capabilities of relational databases, such as enforcing consistency in the data, backing up and restoring data, guaranteeing modifications, and so on.
As for your table schema, it seems that you are getting started and it is fine. In general, I would recommend that your primary keys be named after the table and be integers. Your first table would then be:
CREATE TABLE users (
UserId int primary key,
name varchar(80),
email varchar(80)
);
And the UserId in the second table be declared an int. As a hint to why this is better, users may want to change their names from time to time.
Upvotes: 3
Reputation: 39763
The design is quasi-valid, but not completely:
t_users
would be better of having an autoincrement unsigned
int ID column. (A primary key on a name is almost ALWAYS a bad idea. People change names. People have the same names. Even countries change names sometimes! A numeric is almost always the best choice!)t_posts
can refer to that userID. Joins are now blazing fast.t_posts
has an ID primary key column (good!), but it's varchar (bad!). INT is better. BIGINT
if you need it.You will find your posts later might have multiple topics (stackoverflow "tags"). Don't put them CSV in a varchar field. Create a new table "topics" with ID, description, and a linking table "posts_to_topic" that links each post to one or more topics.
What you need to read up on is indexes. If you want to query for all comments that have topic "programming", you'd usually have an index on the column "topic varchar(80)". This index is small (consider it a seperate table: it contains the indexed column(s) and the primary key), so your (R)DBMS can search it very quickly (tree-structure) and fetch all the primary keys it needs. Then, depending on what you select, the DBMS sends you the information:
I lied. In the last paragraph, I made it all much simpler that it really is. There is an optimiser that will look at the query and determine what indexes can be used. It will check the indexes - depending on the cardinality, table size, columns it might use it, or decide to scan the table anyway. If your table has variable row lengths, fetching the X-th row is much slower than when all rows have the same length (no VARCHAR). And all that depends on what (R)DBMS (or in MySQL, even on what storage engine) you use.
But read about indexes first, on the what, the why, and later the how. After that, you can study the exceptions deeper.
This is a very frequently made beginner mistake, and they go both ways:
Reading about indexes will tell you why this technically is a bad idea, but it's also less elegant on a logical scale: one table is meant to represent one entity (Books. Users. Posts. Pages) - splitting those will result in some very ugly queries. And if you ask someone why they are doing this, the reason is often "for speed", while an extra index on their decision-column would have had the same effect.
Think about it: if you make a post title for each user, write the query that lists the 10 most used topics, and how many posts each of those has. You will have to name every table!
Upvotes: 3
Reputation: 6356
Wouldn't it be better to split every single post into a new table?
No.
If your posts are in different tables, then you have to query hundreds, even thousands of different tables to find all the data.
Not to mention, in addition to the data (which would be the same size, no matter which table it's in), you have the overhead of the table itself.
Upvotes: 1