Reputation: 21
Suppose there’re messages (mostly incoming) that get sent/received quite fast: 80–100 per second. Those get written to the table. We have around 300 millions of them by now and that pushes us to make changes to the organization of data. Initially the table structure was quite simple and naive having ID
as a PRIMARY KEY and two separate indexes for Arrival
and SenderID
.
According to different sources, the following structure seems to be appropriate:
CREATE TABLE `messages` (
`ID` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
`Arrival` TIMESTAMP NOT NULL,
`SenderID` INT UNSIGNED NOT NULL,
`Incoming` TINYINT(1) NOT NULL,
-- Other fields that describe the message skipped
PRIMARY KEY (`SenderID`, `Arrival`, `ID`) USING BTREE,
INDEX `ID` (`ID`) USING BTREE,
INDEX `Arrival_SenderID` (`Arrival`, `SIM`) USING BTREE
)
COLLATE='utf8mb4_unicode_ci'
PARTITION BY RANGE (unix_timestamp(`Arrival`))
(PARTITION past VALUES LESS THAN (1717189200) ENGINE = InnoDB,
PARTITION p202406 VALUES LESS THAN (1719781200) ENGINE = InnoDB,
-- Partitions in between skipped
PARTITION p202512 VALUES LESS THAN (1767214800) ENGINE = InnoDB,
PARTITION future VALUES LESS THAN MAXVALUE ENGINE = InnoDB);
Queries are mostly for messages of a particular SenderID
or for all the senders, both cases for a particular time interval. ID
is used to ensure uniqueness, index is both required by the AUTO_INCREMENT feature and expected to be used to simplify data copying if we need to some day. Plus, a few other tables use the IDs to reference the records (no CONSTRAINTs involved, just plain values).
This table is the largest one, the rest is pretty small and fast.
We do experiments on a server not used anymore but having the 300 millions table left. Several copies of the original database (except the messages
table data) were created and we benchmark different table structures by copying the data from the original database and then measuring query times with real-world queries.
Normally copying those 300 millions of records took 1–2 days in total. But somehow this particular structure causes the copying to be enormously slow. It is done (both for previous experiments and for this one) with queries like these:
INSERT INTO `testdb`.`messages` SELECT * FROM `originaldb`.`messages` WHERE `ID` BETWEEN 1 AND 10000000;
INSERT INTO `testdb`.`messages` SELECT * FROM `originaldb`.`messages` WHERE `ID` BETWEEN 10000001 AND 20000000;
...
Split in chunks to be able to feel the progress and to avoid MySQL exceeding the limits of some buffers (happened when we tried to copy the data in a single such query).
It took around 29 hours for the first chunk. We found bug reports mentioning that innodb_adaptive_hash_index might cause bad performance like that, so copying was stopped, the feature disabled and the whole experiment restarted from the beginning.
The first 2 millions of record took around 10 minutes, but then everything got slow: within the first hour (by now) it hasn’t even copied the 3rd million.
The test server is not used by anything else, has more than 1.5 TB of free space left (the database itself takes around 25–30 GBs only). It’s MySQL 8.0.20.
Any ideas and suggestions are appreciated.
UPD. Not sure if this helps, here’s SHOW ENGINE INNODB STATUS after about 8 millions of rows inserted:
=====================================
2025-02-27 11:51:09 0x1a84 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 44 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 84420 srv_active, 0 srv_shutdown, 2 srv_idle
srv_master_thread log flush and writes: 0
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 4086362
--Thread 2348 has waited at buf0flu.cc line 1635 for 0 seconds the semaphore:
Mutex at 00000215AD29E598, Mutex BUF_POOL_LRU_LIST created buf0buf.cc:1227, lock var 1
--Thread 1252 has waited at buf0buf.cc line 4643 for 0 seconds the semaphore:
Mutex at 00000215AD29E598, Mutex BUF_POOL_LRU_LIST created buf0buf.cc:1227, lock var 1
OS WAIT ARRAY INFO: signal count 1207206
RW-shared spins 24862, rounds 24866, OS waits 4
RW-excl spins 2785, rounds 77298, OS waits 2561
RW-sx spins 310, rounds 9240, OS waits 307
Spin rounds per wait: 1.00 RW-shared, 27.76 RW-excl, 29.81 RW-sx
------------
TRANSACTIONS
------------
Trx id counter 329387481
Purge done for trx's n:o < 329387481 undo n:o < 0 state: running but idle
History list length 0
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 283767129662384, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 283767129660736, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 283767129659912, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 329387336, ACTIVE 85603 sec inserting, thread declared inside InnoDB 3111
mysql tables in use 2, locked 2
25755 lock struct(s), heap size 3743952, 8739195 row lock(s), undo log entries 8713444
MySQL thread id 8, OS thread handle 3208, query id 517 localhost 127.0.0.1 root executing
INSERT INTO `testdb`.`messages` SELECT * FROM `originaldb`.`messages` WHERE `ID` BETWEEN 1 AND 10000000
---TRANSACTION 329380910, ACTIVE 86147 sec recovered trx
ROLLING BACK 1 lock struct(s), heap size 1136, 0 row lock(s), undo log entries 194935309
--------
FILE I/O
--------
I/O thread 0 state: wait Windows aio (insert buffer thread)
I/O thread 1 state: wait Windows aio (log thread)
I/O thread 2 state: wait Windows aio (read thread)
I/O thread 3 state: wait Windows aio (read thread)
I/O thread 4 state: wait Windows aio (read thread)
I/O thread 5 state: wait Windows aio (read thread)
I/O thread 6 state: wait Windows aio (write thread)
I/O thread 7 state: wait Windows aio (write thread)
I/O thread 8 state: complete io for buf page (write thread)
I/O thread 9 state: complete io for buf page (write thread)
Pending normal aio reads: [0, 0, 0, 0] , aio writes: [0, 0, 1, 1] ,
ibuf aio reads:, log i/o's:, sync i/o's:
Pending flushes (fsync) log: 1; buffer pool: 76
5625836 OS file reads, 13062964 OS file writes, 4442547 OS fsyncs
62.77 reads/s, 16384 avg bytes/read, 150.25 writes/s, 50.25 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 1, free list len 1908, seg size 1910, 0 merges
merged operations:
insert 0, delete mark 0, delete 0
discarded operations:
insert 0, delete mark 0, delete 0
Hash table size 34679, node heap has 0 buffer(s)
Hash table size 34679, node heap has 0 buffer(s)
Hash table size 34679, node heap has 0 buffer(s)
Hash table size 34679, node heap has 0 buffer(s)
Hash table size 34679, node heap has 0 buffer(s)
Hash table size 34679, node heap has 0 buffer(s)
Hash table size 34679, node heap has 0 buffer(s)
Hash table size 34679, node heap has 0 buffer(s)
0.00 hash searches/s, 351.72 non-hash searches/s
---
LOG
---
Log sequence number 518180430405
Log buffer assigned up to 518180430405
Log buffer completed up to 518180430405
Log written up to 518180430405
Log flushed up to 518180429297
Added dirty pages up to 518180430405
Pages flushed up to 518164401302
Last checkpoint at 518164401302
5349965 log i/o's done, 60.36 log i/o's/second
----------------------
BUFFER POOL AND MEMORY
----------------------
Total large memory allocated 137428992
Dictionary memory allocated 916291
Buffer pool size 8192
Free buffers 2
Database pages 7961
Old database pages 2919
Modified db pages 7038
Pending reads 0
Pending writes: LRU 6, flush list 0, single page 0
Pages made young 1867654, not young 12082459
0.00 youngs/s, 0.00 non-youngs/s
Pages read 5625439, created 91520, written 5881777
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
Buffer pool hit rate 959 / 1000, young-making rate 15 / 1000 not 72 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 7961, unzip_LRU len: 0
I/O sum[5980]:cur[55], unzip sum[0]:cur[0]
--------------
ROW OPERATIONS
--------------
1 queries inside InnoDB, 0 queries in queue
0 read views open inside InnoDB
Process ID=704, Main thread ID=0000000000000EA8 , state=sleeping
Number of rows inserted 8804050, updated 0, deleted 0, read 8713446
75.04 inserts/s, 0.00 updates/s, 0.00 deletes/s, 75.04 reads/s
Number of system rows inserted 1039, updated 1484, deleted 983, read 477772
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
----------------------------
END OF INNODB MONITOR OUTPUT
============================
UPD2. The original naively indexed table was this:
CREATE TABLE IF NOT EXISTS `Messages` (
`ID` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`Arrival` timestamp NOT NULL DEFAULT current_timestamp(),
`SenderID` int(10) unsigned NOT NULL,
-- Other fields here
PRIMARY KEY (`ID`),
KEY `SenderID` (`SenderID`),
KEY `Arrival` (`Arrival`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
Copying performance was acceptable for the target table with the same structure and with same fields but PRIMARY KEY(Arrival, SenderID, ID), KEY(ID) which took 12h40m for 300 millions. But the partitioned target with supposedly optimal indexes is slow.
Number of distinct SenderIDs is around 12000. Amount of RAM is 16 GB, innodb_buffer_pool_size is 128 MB which looks too little (don’t know why this value was chosen originally). But, if that matters, it’s still unclear why the difference in the speed of copying the same data.
Not sure if this could affect performance, but there were a few forced terminations of MySQL: (1) due to lack of disk space, (2) due to lock buffer overflow, (3) to stop the unacceptably slow copying. But the first two happened before all the attempts to copy started, including those which showed good performance.
Upvotes: 1
Views: 81
Reputation: 142483
Excellent design for your PARTITIONing
. (My blog on Partitioning)
Possible criticism: Are those partitions are "daily" and you now have about 370? I would recommend switching to weekly. This would make queries that cannot use pruning run faster.
The fastest way to copy a non-partitioned table is via its PRIMARY KEY
. If one sender_id
is not too many rows, simply copy one sender at a time. More discussion: See my blog on Chunking, though it does not address your specific case. It does discuss composite PRIMARY KEY
issues and shows a simple way to chunk (ORDER BY ... LIMIT 1000,1
) to get the next stopping point.
Since your table has a good partitioning, let's discuss several options. [I don't know which will be best.] Chunk the data one of these ways:
By sender_id, as mentioned above. That uses "locality of reference" but requires looking at all partitions.
By partition. That uses partition pruning. This can be done either by specifying the partition name of saying WHERE Arrival >= ... AND arrival < ...
.
Somehow use "transportable tables" -- Perhaps detach a partition, making it a "table"; copy that "table"; then reattach the "table" back to the original partitioned table. [I have never tried this. I don't know how much downtime would be needed.]
LVM -- This plays games with the filesystem. [LVM may not be available on your OS.] Pro: Copying the entire table takes seconds or minutes. Con: LVM has to be established at the OS and filesystem level beforehand.
For any chunking: Whether there are 30 or 30000 "copy-and-paste queries" doesn't matter much. I recommend 1000 rows at a time; that will run at about 99% of the theoretical maximum speed. (And will avoid actually slowing down because of logs, etc.!)
Why 128M for the buffer pool is too small
SenderID
. Partitioning neither helps nor hurts.Further, note that all 3 indexes (including the clustered PK) need updating whenever a row is inserted. A portion of buffer_pool is reserved for delayed updating of non-unique secondary keys
A large insert will be performed in the order that the rows arrive. Let's analyze INSERT ... SELECT * ... WHERE id BETWEEN ...
.
(ID)
(and probably INDEX(Arrival, ...)
) will each need only 16KB (1 block) at a time for the insert.(SenderID, ...)
will be jumping around. The above math (12K*16K) says that a lot of the buffer pool will be used -- essentially 192MB before having to bump a block out of the buffer pool. Thrashing.PARTITION BY ID & PK(SenderID...)
does not change the math much.Another thought
Three ways to fill a table:
As you appear to do it -- CREATE TABLE
with all the indexes in place. This leads to the thrashing I detailed above.
Have no indexes on the Partition table while doing the INSERTs
. Then do ALTER TABLE ADD PRIMARY KEY(...), INDEX(...), INDEX(...);
This will (I think) speed up the index generation by doing the sorting on disk.
I prefer to have the PK already in the table while doing the Inserts, then do the secondary indexes in an ALTER
. But I do not know whether that is better or worse than delaying the PK.
If you have the time and space, this is simpler and better:
ALTER TABLE DROP ((all indexes, including PK))
ALTER TABLE .. ADD/REMOVE PARTITIONING ...
ALTER TABLE ADD ((the better set of indexes for the new layout))
Upvotes: 1