Reputation: 1519

MySQL and a table with 100+ millions of rows

I have a few tables with more than 100+ millions of rows. I get about 20-40 millions of rows each month.

At this moment everything seems fine: - all inserts are fast - all selects are fast ( they are using indexes and don't use complex aggregations )

However, I am worried about two things, what I've read somewhere: - When a table has few hundred millions of rows, there might be slow inserts, because it might take a while to re-balance the indexes ( binary trees ) - If index doesn't fit into memory, it might take a while to read it from the different parts of the disk.

Any comments would be highly appreciated. Any suggestions how can I avoid it or how can I fix/mitigate the problem if/when it happens would be highly appreciated.

( I know we should start doing a sharding at some day )

Thank you in advance.

Upvotes: 35

Answers (2)

Rick James

Reputation: 142528

I will attempt to address the points being made by the OP and the other responders. The Question only touches the surface; this Answer follows suit. We can dig deeper in more focused Questions.

A trillion rows gets dicey. 100M is not necessarily problematic.
PARTITIONing is not a performance panacea. The main case where it can be useful way is when you need to purge "old" data. (DROP PARTITION is a lot faster than DELETEing a zillion rows.)
INSERTs with an AUTO_INCREMENT PRIMARY KEY will 'never' slow down. This applies to any temporal key and/or small set of "hot spots". Example PRIMARY KEY(stock_id, date) is limited to as many hot spots as you have stocks.
INSERTs with a UUID PRIMARY KEY will get slower and slower. But this applies to any "random" key.
Secondary indexes suffer the same issues as the PK, however later. This is because it is dependent on the size of the BTree. (The data's BTree ordered by the PK is usually bigger than each secondary key.)
Whether an index (including the PK) "fits in memory" matters only if the inserts are 'random' (as with a UUID).
For Data Warehouse applications, it is usually advisable to provide Summary Tables instead of extra indexes on the 'Fact' table. This yields "report" queries that may be as much as 10 times as fast.
Blindly using AUTO_INCREMENT may be less than optimal.
The BTree for the data or index of a million-row table will be about 3 levels deep. For a trillion rows, 6 levels. This "number of levels" has some impact on performance.
Binary trees are not used; instead BTrees (actually B+Trees) are used by InnoDB.
InnoDB mostly keeps its BTrees balanced without much effort. Don't worry about it. (And don't use OPTIMIZE TABLE.)
All activity is done on 16KB blocks (of data or index) and done in RAM (in the buffer_pool). Neither a table nor an index is "loaded into RAM", at least not explicitly as a whole unit.
Replication is useful for read scaling. (And readily available in MySQL.)
Sharding is useful for write scaling. (This is a DYI task.)
As a Rule of Thumb, keep half of your disk free for various admin purposes on huge tables.
Before a table gets into the multi-GB size range, it is wise to re-think the datatypes and normalization.
The main tunable in InnoDB (these days) is innodb_buffer_pool_size, which should (for starters) be about 70% of available RAM.
Row_format=compressed is often not worth using.
YouTube, Facebook, Google, etc, are 'on beyond' anything discussed in this Q&A. They use thousands of servers, custom software, etc.

If you want to discuss your specific application, let's see some details. Different apps need different techniques.

My blogs, which provide more details on many of the above topics: https://mysql.rjweb.org

Upvotes: 11

tadman

Reputation: 211730

Today is the day you should think about sharding or partitioning because if you have 100MM rows today and you're gaining them at ~30MM per month then you're going to double the size of that in three months, and possibly double it again before the year is out.

At some point you'll hit an event horizon where your database is too big to migrate. Either you don't have enough working space left on your disk to switch to an alternate schema, or you don't have enough down-time to perform the migration before it needs to be operational again. Then you're stuck with it forever as it gets slower and slower.

The performance of write activity on a table is largely a function of how difficult the indices are to maintain. The more data you index the more punishing writes can be. The type of index is all relevant, some are more compact than others. If your data is lightly indexed you can usually get away with having more records before things start to get cripplingly slow, but that degradation factor is highly dependent on your system configuration, your hardware, and your IO capacity.

Remember, InnoDB, the engine you should be using, has a lot of tuning parameters and many people leave it set to the really terrible defaults. Have a look at the memory allocated to it and be sure you're doing that properly.

If you have any way of partitioning this data, like by month, by customer, or some other factor that is not going to change based on business logic, that is the data is intrinsically not related, you will have many simple options. If it's not, you'll have to make some hard decisions.

The one thing you want to be doing now is simulating what your table's performance is like with 1G rows in it. Create a sufficiently large, suitably varied amount of test data, then see how well it performs under load. You may find it's not an issue, in which case, no worries for another few years. If not, start panicking today and working towards a solution before your data becomes too big to split.

Database performance generally degrades in a fairly linear fashion, and then at some point it falls off a cliff. You need to know where this cliff is so you know how much time you have before you hit it. The sharp degradation in performance usually comes when your indexes can't fit in memory and when your disk buffers are stretched too thin to be useful.

Upvotes: 43

MySQL and a table with 100+ millions of rows

Answers (2)

Related Questions