Best practices to structure a database to be scaling-ready

Question

I know this is a very generic and subjective question, so feel free to vote to close it if it does not meet the StackOverflow netiquette.. but for me, it's worth trying ;)

I've never built a high-traffic application since now, so I'm not aware (except for some reading on the web) about scaling practices.

How can I design a database that, when a scaling is needed, I dont have to refactor the database structure, or the application code?

I know that development (and optimization) should come step-by-step, optimize bottleneck as they happen, and is nearly impossible to design the perfect structure when you don't know how many users you'll have and how would they use the database (e.g. read/write ratio), I'm just looking for a good base to start.

What are the best practices for making a structure almost ready to be scaled with partitioning and sharding, and what hacks must be absolutely avoided?

Edit some detail about my application:

The application will run as a multisite behavior
I'll have a database for each application version (db_0_0_1, db_0_0_2, etc..)*
Every 'site' will have a schema inside a database* and a role that can access only his own schemas
Application code will be mostly PHP and few things (daemons and maintenance things) in Python
Web server will probably be Nginx and lighttpd or node.js as support for long-polling tasks (e.g. chat)
Caching will be done with memcached (plus apc for things strictly related to the php code, as it can be used outside php)

filiprem · Accepted Answer

The question is really generic, but here are few tips:

Do not use any session variables (pg_backend_pid(), inet_client_addr()) or per-session control (SET ROLE, SET SESSION) in application code.
Do not use explicit transaction control (BEGIN/COMMIT/SET TRANSACTION) in application code. All such logic should be wrapped in UDFs. This enables stateless, statement-mode pooling which enables fastest possible DB pooling. (see pgbouncer docs, and pg wiki for more info)
Encapsulate all App<->Db communication in well defined DB API of UDFs - this will let you use PL/Proxy. If doing this with all SELECTs is too hard, do it at least for all data writes (INSERT/UPDATE/DELETE). Example: instead of INSERT INTO users(name) VALUES('Joe') you need SELECT create_user('Joe').
check your DB schema - is it easy to separate all data belonging to given user? (most probably this will be the partitioning key). All that's left is common, shared data which will need to be replicated to all nodes.
think of caching before you need it. what will be caching key? what will be cache timeout? will you use memcached?

Best practices to structure a database to be scaling-ready

Answers (1)

Related Questions