dingalapadum
dingalapadum

Reputation: 2177

One large table or many small ones in database?

Say I want to create a typical todo-webApp using a db like postgresql. A user should be able to create todo-lists. On this lists he should be able to make the actual todo-entries.

I regard the todo-list as an object which has different properties like owner, name, etc, and of course the actual todo-entries which have their own properties like content, priority, date ... .

My idea was to create a table for all the todo-lists of all the users. In this table I would store all the attributes of each list. But the questions which arises is how to store the todo-entries themselves? Of course in an additional table, but should I rather:

1. Create one big table for all the entries and have a field storing the id of the todo-list they belong to, like so:

todo-list: id, owner, ...
todo-entries: list.id, content, ...

which would give 2 tables in total. The todo-entries table could get very large. Although we know that entries expire, hence the table only grows with more usage but not over time. Then we would write something like SELECT * FROM todo-entries WHERE todo-list-id=id where id is the of the list we are trying to retrieve.

OR

2. Create a todo-entries table on a per user basis.

todo-list: id, owner, ...
todo-entries-owner: list.id, content,. ..

Number of entries table depends on number of users in the system. Something like SELECT * FROM todo-entries-owner. Mid-sized tables depending on the number of entries users do in total.

OR

3. Create one todo-entries-table for each todo-list and then store a generated table name in a field for the table. For instance could we use the todos-list unique id in the table name like:

todo-list: id, owner, entries-list-name, ...    
todo-entries-id: content, ... //the id part is the id from the todo-list id field. 

In the third case we could potentially have quite a large number of tables. A user might create many 'short' todo-lists. To retrieve the list we would then simply go along the lines SELECT * FROM todo-entries-id where todo-entries-id should be either a field in the todo-list or it could be done implicitly by concatenating 'todo-entries' with the todos-list unique id. Btw.: How do I do that, should this be done in js or can it be done in PostgreSQL directly? And very related to this: in the SELECT * FROM <tablename> statement, is it possible to have the value of some field of some other table as <tablename>? Like SELECT * FROM todo-list(id).entries-list-name or so.

The three possibilities go from few large to many small tables. My personal feeling is that the second or third solutions are better. I think they might scale better. But I'm not sure quite sure of that and I would like to know what the 'typical' approach is.

I could go more in depth of what I think of each of the approaches, but to get to the point of my question:

Follow up:

Upvotes: 2

Views: 3307

Answers (1)

Patrick
Patrick

Reputation: 32179

The only viable option is the first. It is far easier to manage and will very likely be faster than the other options.

Image you have 1 million users, with an average of 3 to-do lists each, with an average of 5 entries per list.

Scenario 1

In the first scenario you have three tables:

  • todo_users: 1 million records
  • todo_lists: 3 million records
  • todo_entries: 15 million records

Such table sizes are no problem for PostgreSQL and with the right indexes you will be able to retrieve any data in less than a second (meaning just simple queries; if your queries become more complex (like: get me the todo_entries for the longest todo_list of the top 15% of todo_users that have made less than 3 todo_lists in the 3-month period with the highest todo_entries entered) it will obviously be slower (as in the other scenarios). The queries are very straightforward:

-- Find user data based on username entered in the web site
-- An index on 'username' is essential here
SELECT * FROM todo_users WHERE username = ?;

-- Find to-do lists from a user whose userid has been retrieved with previous query
SELECT * FROM todo_lists WHERE userid = ?;

-- Find entries for a to-do list based on its todoid
SELECT * FROM todo_entries WHERE listid = ?;

You can also combine the three queries into one:

SELECT u.*, l.*, e.* -- or select appropriate columns from the three tables
FROM todo_users u
LEFT JOIN todo_lists l ON l.userid = u.id
LEFT JOIN todo_entries e ON e.listid = l.id
WHERE u.username = ?;

Use of the LEFT JOINs means that you will also get data for users without lists or lists without entries (but column values will be NULL).

Inserting, updating and deleting records can be done with very similar statements and similarly fast.

PostgreSQL stores data on "pages" (typically 4kB in size) and most pages will be filled, which is a good thing because reading a writing a page are very slow compared to other operations.

Scenario 2

In this scenario you need only two tables per user (todo_lists and todo_entries) but you need some mechanism to identify which tables to query.

  • 1 million todo_lists tables with a few records each
  • 1 million todo_entries tables with a few dozen records each

The only practical solution to that is to construct the full table names from a "basename" related to the username or some other persistent authentication data from your web site. So something like this:

username = 'Jerry';
todo_list = username + '_lists';
todo_entries = username + '_entries';

And then you query with those table names. More likely you will need a todo_users table anyway to store personal data, usernames and passwords of your 1 million users.

In most cases the tables will be very small and PostgreSQL will not use any indexes (nor does it have to). It will have more trouble finding the appropriate tables, though, and you will most likely build your queries in code and then feed them to PostgreSQL, meaning that it cannot optimize a query plan. A bigger problem is creating the tables for new users (todo_list and todo_entries) or deleting obsolete lists or users. This typically requires behind-the scenes housekeeping that you avoid with the previous scenario. And the biggest performance penalty will be that most pages have only little content so you waste disk space and lots of time reading and writing those partially filled pages.

Scenario 3

This scenario is even worse that scenario 2. Don't do it, it's madness.

  • 3 million tables todo_entries with a few records each

So...

Stick with option 1. It is your only real option.

Upvotes: 5

Related Questions