Blub
Blub

Reputation: 13614

SQL performance when querying tables

I'm new to SQL and I don't understand performance implications. It seems like SQL databases store everything in a single place. Doesn't this mean that tables grow extremely large, very quickly? Won't this hurt performance?

Example Stackoverflow model, but with threaded comments:

CREATE TABLE t_users (
   name varchar(80) primary key,
   email varchar(80)

);

CREATE TABLE t_posts (
   id varchar(80) primary key,
   userid varchar(80) references t_users(name),
   title varchar(80),
   description text,
   topic varchar(80),
   path text

);

Is this a valid design? All posts of every user ever are stored in the same table... So if I want to query for all comments that have topic "programming", it would need to look through every single post, even through the posts that have different topics because they are all stored in the same table....this also means that if I make more complicated queries, they will exponentially grow slower the bigger my table on disk is. Wouldn't it be better to split every single post into a new table?

Upvotes: 0

Views: 146

Answers (3)

Gordon Linoff
Gordon Linoff

Reputation: 1269603

The real answer to your question first two questions are "yes" and "no". "Yes", tables do grow quickly. And "no" this generally doesn't hurt performance.

Although you can think of a table as a single file that you scan to find results, SQL does much more than this. Almost all databases manage something called a page table, which means the following:

  • Once the table is read, if it can fit in memory it will stay there until it needs to move out. This greatly speeds further queries on the table.
  • Loading the page table generally operates asynchronously from processing.

These two points, by themselves, are pretty hard (but not impossible) to implement in most programming languages. In addition, SQL gives you indexes and table partitions which can further speed up processing of a single table. And finally, most versions of SQL support multiple processors/threads for almost all their functions.

When it comes to combining results from multiple tables or aggregating results, a database has many person-years of effort in algorithms for doing joins and aggregations. It is unlikely that you would do better in this situation.

And, these capabilities don't touch on many other capabilities of relational databases, such as enforcing consistency in the data, backing up and restoring data, guaranteeing modifications, and so on.

As for your table schema, it seems that you are getting started and it is fine. In general, I would recommend that your primary keys be named after the table and be integers. Your first table would then be:

CREATE TABLE users (
    UserId int primary key,
    name varchar(80),
    email varchar(80)
); 

And the UserId in the second table be declared an int. As a hint to why this is better, users may want to change their names from time to time.

Upvotes: 3

Konerak
Konerak

Reputation: 39763

The design is quasi-valid, but not completely:

  • Your t_users would be better of having an autoincrement unsigned int ID column. (A primary key on a name is almost ALWAYS a bad idea. People change names. People have the same names. Even countries change names sometimes! A numeric is almost always the best choice!)
  • That way, your t_posts can refer to that userID. Joins are now blazing fast.
  • You can even enforce the referential integrity: no post can be made by users that do not exist. Or when a user is deleted, the posts are deleted too.
  • Your t_posts has an ID primary key column (good!), but it's varchar (bad!). INT is better. BIGINT if you need it.

You will find your posts later might have multiple topics (stackoverflow "tags"). Don't put them CSV in a varchar field. Create a new table "topics" with ID, description, and a linking table "posts_to_topic" that links each post to one or more topics.

Indexes

What you need to read up on is indexes. If you want to query for all comments that have topic "programming", you'd usually have an index on the column "topic varchar(80)". This index is small (consider it a seperate table: it contains the indexed column(s) and the primary key), so your (R)DBMS can search it very quickly (tree-structure) and fetch all the primary keys it needs. Then, depending on what you select, the DBMS sends you the information:

  • if you only need columns already in the index, it can fetch everything from the index.
  • if you need other columns, it will only now actually access the t_posts table, and use the primary keys from the indexes to access the rows.

Simplification

I lied. In the last paragraph, I made it all much simpler that it really is. There is an optimiser that will look at the query and determine what indexes can be used. It will check the indexes - depending on the cardinality, table size, columns it might use it, or decide to scan the table anyway. If your table has variable row lengths, fetching the X-th row is much slower than when all rows have the same length (no VARCHAR). And all that depends on what (R)DBMS (or in MySQL, even on what storage engine) you use.

But read about indexes first, on the what, the why, and later the how. After that, you can study the exceptions deeper.

Multiple tables for the same data

This is a very frequently made beginner mistake, and they go both ways:

  • "We can put everything in one table, if we add a column "TYPE" and make everything varchar!"
  • "We can just have 10000 tables in our database, table_1, table_2, etc!"

Reading about indexes will tell you why this technically is a bad idea, but it's also less elegant on a logical scale: one table is meant to represent one entity (Books. Users. Posts. Pages) - splitting those will result in some very ugly queries. And if you ask someone why they are doing this, the reason is often "for speed", while an extra index on their decision-column would have had the same effect.

Think about it: if you make a post title for each user, write the query that lists the 10 most used topics, and how many posts each of those has. You will have to name every table!

Upvotes: 3

Adam V
Adam V

Reputation: 6356

Wouldn't it be better to split every single post into a new table?

No.

If your posts are in different tables, then you have to query hundreds, even thousands of different tables to find all the data.

Not to mention, in addition to the data (which would be the same size, no matter which table it's in), you have the overhead of the table itself.

Upvotes: 1

Related Questions