Slim Shady
Slim Shady

Reputation: 1085

Lots of small mysql table or one big table

I have a forum where i have properties like - > follow,voteup,votedown,report,favorite,view etc for each thread,answers,comments.

Which approach will be performance wise faster and better ? I am expecting billions of favorite,views etc....just like youtube

Approach One

Make one big table counter

counter_id | user_id | object_id | object_type | property

where object_type = thread,comment,answer with their respective id from tables threads,comments,answers

and property = follow,voteup,votedown,report etc

Approach Two

Make individual tables of follow,views,report etc

Upvotes: 0

Views: 271

Answers (2)

Rick James
Rick James

Reputation: 142218

Answer, Part 1: Plan on redesigning as you go.

The best advice I can give you is to plan for change. What you design for the first million will not work for 30 million. The 30-million design will not survive to a billion. Whatever you do after reading this thread may last you through 30K rows.

Why is this? Well, partially because you will not be able to do it in a single machine. Don't shard your database now, but keep in the back of your mind that you will need to shard it. At that point, much of what worked on a single machine will either not work on multiple machines, or will be too slow to work. So you will have to redesign.

Let me point out another aspect of 1 billion rows. Think how fast you have to do INSERTs to grow a table to 1B rows in 1 year. It's over 30 per second. That's not bad, until you factor in the spikes you will get.

And what will happen when your second billion won't fit on the disk you have laid out?

Anyone who grows to a billion rows has to learn as he goes. The textbooks don't go there; the manuals don't go there; only the salesmen go there, but they don't stick around after the check clears. Look at YouTube (etc) -- almost nothing is "off the shelf".

And think of how many smart designers you will need to hire to get to 1 billion.

It is painful to add a column to a billion-row table, so (1) plan ahead, and (2) design a way to make changes without major outages.

Answer, Part 2: Some tips

Here are some of my comments on the ideas bounced around, and some tips from someone who has dealt with a billion-row, sharded system (not YouTube, but something similar).

Normalize vs denormalize: My motto: "Normalize, but don't overnormalize." You'll see what I mean after you have done some of it.

One table vs many: Two tables with the essentially identical CREATE TABLEs should usually be a single table. (Sharding, of course, violates that.) OTOH, if you need thousands of UPDATE...view_count = view_count + 1 per second it won't survive to a billion. However, it might survive to a million; then plan for change.

Minimize the size of datatypes -- Using a MEDIUMINT instead of an INT for one column saves a gigabyte.

Do not paginate using OFFSET and LIMIT. (I have a blog on a workaround.)

Batch INSERTs where possible.

Use InnoDB, you don't want to wait hours for a REPAIR to finish on a MyISAM table.

The simple task of getting a unique ID for the 'next' item can be a huge problem in a sharded system. Wait until you are closer to needing sharding before redesigning that part. Do not use UUIDs for a billion-row table; they will perform poorly. So, don't even think about UUIDs now; you will have throw them away.

Long before you hit 1 billion, you will have nightmares about the one machine crashing. Think about replication, HA, etc, early. It is painful to set up such after you have big tables.

Upvotes: 1

Philip Couling
Philip Couling

Reputation: 14883

There is no single answer to this, its quite subjective.

Most commonly it's best to consider the use cases for your design. Think carefully about what these fields will be used for before you add them to any table. And don't think that you have to add a numeric primary key ("ID") to every table. A table for tracking follows is just fine with only the fields user id | object id | object type and all three fields contained in the primary key.

Its unlikely your code ever will be used with such performance constraints as youtube or even stack overflow. If it is you will most likely have remodelled the database by then.

However for the sake of the exercise consider where and how data is to be used...

I would have separate tables as follows

Follow User feeds, probably needs its own table as most commonly it gets hit from anywhere (a bit like a global inbox). The follow should also have some flag or timestamp to show changes so that its very easy to evaluate when changes have occurred since the last time the user was online.......

This is because a user needs to see what they've followed as some sort of feed and other's need to see how many people have followed. But other's don't need to see who else has followed.

Vote up, Vote down That's just vote and a +- flag. Do denormalize this... That is store BOTH a user's individual votes in a table and store a count of votes against object on a field on the object's table. That way you only ever check a single user's vote (they're own) for a page view. The counts are retrieved from the same row containing the content.

Again. A user needs to see what they've up/down voted. You need to check they're not voting twice. What matters is the final count. So checking an object with a million up votes should not have to hit a million rows - Just one.

Pro tip: Some database engines perform badly if you constantly update rows with large content. So consider a "meta-data" table for all objects. Which stores counts such as this. This leaves the meta data free to update frequently even if the content doesn't.

Favorite Own table again. user id | object id | object type. If you want to display number of favourites to the public then keep a count of this against the object, don't do a select count(*) every page view.

View Why even store this? Keep a count against the object. If you're going to store a history then make sure you put a timestamp against it and purge it regularly. You don't need to store what a user was looking at six months ago.


As a general observation all of these are separate tables with the exception of up and down votes.

You should denormalize the counts to reduce the quantity of data your server needs to access to determine a page view. Most commonly a page view should be the fastest thing. Any form of update can be a little slower.


Where I mention for favourites and others that they don't need an additional primary key field. What I mean is that they have a primary key, just not an additional field. For example favourites could be:

CREATE TABLE favourites (
    user INT,
    object_type INT,
    object_id INT,
    PRIMARY KEY (user, object_type, object_id)
) 

There's simply no reason to have a favorite_id field.

Upvotes: 2

Related Questions