Scalability And Database Disk

In terms of scalability and responding to millions of requests per second, web servers, application servers and database servers can be added to the cluster. But what I do not understand very well is that, isn't the disk where database is stored the bottleneck here? Since there should be only one copy of the data that is manipulated, all the system will be limited in the IO speed of that disk, right? Am I missing something? How is that limitation overcome? For example, how does Google, Facebook etc. deal with that?

Upvotes: 1

Answers (3)

Mike Sherrill 'Cat Recall'

Reputation: 95562

Since there should be only one copy of the data that is manipulated, all the system will be limited in the IO speed of that disk, right?

No. First, it's not only data that's manipulated. For example, an UPDATE statement will change data, any relevant indexes, transaction log files, and so on. Second, most database management systems let you put database objects on different disks. So you could create the table on one disk, store its indexes on another disk, and write transaction log files on yet another disk.

These things are independent of other technology used to improve performance--caching, partitioning, sharding (implies multiple database servers), etc.

Upvotes: 2

Neville Kuyt

Reputation: 29629

The number of solutions that need to scale to those levels - "millions of requests per second" is still quite small. They all have complex architectures to cope with scalability requirements.

The most common strategies I'm aware of are:

Caching: cache everything, so you don't need to hit the database. Most high-scalability solutions cache entire (web) pages, page fragments, data entities/objects, and query results. This requires significant architectural thought, as a cache by definition can go stale, but flushing the cache incorrectly may create a worse scalability problem.

Partitioning: storing different data entities on different physical devices - disks, or servers, or hosting locations. For instance, I believe that Twitter stores "new" tweets in different locations from "old" tweets. This means that when they retrieve someone's feed, they can run queries on different servers once they pass the "new" tweets.

Lazy loading: alongside partitioning, only loading the information you really need - really helps. This is why you see "spinning wheel" loading symbols as you scroll down your facebook page.

Denormalization/pre-computing: by sacrificing "single source of truth", and instead pre-computing information (e.g. the number of notifications in your Facebook feed), they avoid running complex queries at page-load time.

CQRS: by separating responsibilities for "managing" and "querying" data, they replace the traditional "single source of truth" architecture model with a message-based system, which scales much better.

Upvotes: 2

nvogel

Reputation: 25526

Google, Facebook etc don't store data in just one place, they distribute and replicate it over many servers and disks. Database technologies like Hadoop and SQL Server Azure keep at least three copies of your data on disk - that's in addition to any in-memory cacheing that might also take place.

Upvotes: 2

Scalability And Database Disk

Answers (3)

Related Questions