user2394303
user2394303

Reputation:

Will I speed up relational database by spliting huge table into several small tables?

assume there is one huge table which contains one billion rows in it, and I split it, using hash function which take primary key as parameter, into 1000 tables which contains one million rows respectively. Will the speed of querying and updating get faster?

Upvotes: 1

Views: 60

Answers (2)

Anatoly
Anatoly

Reputation: 15530

Usually there is an overhead to keep index updated on INSERT/UPDATE/DELETE, database engine should have enough memory to keep all index and data in buffers to avoid redundant I/O. It can be helpful to know the index and data size per table (MySQL):

SET @db_name = 'you_database';

SELECT
  TBname,
  CONCAT(LPAD(REPLACE(FORMAT(B.DSize/POWER(102,pw),3),',',''),17,' '),' ', SUBSTR(' KMGTP',pw,1),'B') "Data Size", 
  CONCAT(LPAD(REPLACE(FORMAT(B.ISize/POWER(102,pw),3),',',''),17,' '),' ', SUBSTR(' KMGTP',pw,1),'B') "Index Size",
  CONCAT(ROUND(B.ISize * 100 / B.DSize), ' %') "Percentage", 
  CONCAT(LPAD(REPLACE(FORMAT(B.TSize/POWER(102,pw),3),',',''),17,' '),' ', SUBSTR(' KMGTP',pw,1),'B') "Table Size"
FROM 
  (SELECT table_name TBname, data_length DSize, index_length ISize, data_length+index_length TSize 
     FROM information_schema.tables WHERE table_schema = @db_name) B,
   (SELECT 3 pw) A ORDER BY ISize DESC, DSize DESC

Wikipedia says:

An index is any data structure that improves the performance of lookup. There are many different data structures used for this purpose. There are complex design trade-offs involving lookup performance, index size, and index update performance. Many index designs exhibit logarithmic O(log(N)) lookup performance and in some applications it is possible to achieve flat O(1) performance.

If the number of database tables correspond to number of file names, be careful with those things:

  • Number of free inodes (df -i)
  • Number of open files (cat /proc/sys/fs/file-max)

In terms of O(1) algorithm complexity database size doesn't really matter but unless you data and index fit the memory, the bottleneck is Disk I/O (even for SSD disks). From the other side database configuration may require full ACID compliant that end up with flushing to disk very frequently followed by performance degradation on a bigger database under a load.

Back to original question. It makes sense to split a big table into number of small tables to speed up index management that performs better (and consumes less memory) on a small data set. If the sharding key is difficult to find, you may consider an alternative naming convention with month and year as table name suffix (posts -> posts_2015_06, posts_2015_07, posts_2015_08) or archiving strategy (posts -> posts_archive, posts_fresh). It depends on number of INSERT/UPDATE/DELETE requests that happen against historical data.

Upvotes: 1

duffymo
duffymo

Reputation: 308763

The answer is: it depends on the data, the partition, your queries,and especially indexes.

Such a partition makes sense if you split by date. Historical data is routinely moved out of transactional stores into reporting or warehousing databases this way.

I'd wonder if you need indexes. You should have indexes on columns in WHERE clauses.

EXPLAIN PLAN on slow queries and look for table scan.

A billion rows is not extraordinary.

Upvotes: 0

Related Questions