Reputation: 298
I have one table consisting of 100+ millions rows, and want to copy data into another table. I have 1 requirements, 1. Query execution must not block the other operations to these database tables, I have written a stored procedure as following
I count the number of rows into source table then have a loop but copy 10000 rows in each iterations, start transaction and commit it. then read next 10000 by offset.
CREATE PROCEDURE insert_data()
BEGIN
DECLARE i INT DEFAULT 0;
DECLARE iterations INT DEFAULT 0;
DECLARE rowOffset INT DEFAULT 0;
DECLARE limitSize INT DEFAULT 10000;
SET iterations = (SELECT COUNT(*) FROM Table1) / 10000;
WHILE i <= iterations DO
START TRANSACTION;
INSERT IGNORE INTO Table2(id, field2, field3)
SELECT f1, f2, f3
FROM Table1
ORDER BY id ASC
LIMIT limitSize offset rowOffset;
COMMIT;
SET i = i + 1;
SET rowOffset = rowOffset + limitSize;
END WHILE;
END$$
DELIMITER ;
The query executes without locking the tables but after copying few millions rows it has become too slow. Please suggest any better way to do the task. Thanks you!
Upvotes: 4
Views: 6867
Reputation: 51
For my setup and needs - I have to copy 300 to 500 million rows. promptly. On a massively under-spec server.
Dump to "csv"; Split result into multiple files (my case 200k rows was optimal) Import split files
SELECT a.b.c INTO OUTFILE '/path/dumpfile.csv' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n';
split -a 6 -l 200000 dumpfile.csv FileNamePrefix. // . allows numbering as ext
for f in ./*;
do
mysql -uUser -pPassword dbname -e "set autocommit = 0; set unique_checks = 0; set foreign_key_checks = 0; set sql_log_bin=0; LOAD DATA CONCURRENT INFILE '/path/to/split/files/"$f"' IGNORE INTO TABLE InputTableName FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' (a, b, c);commit;";
echo "Done: '"$f"' at $(date)";
done
Upvotes: 1
Reputation: 298
Thanks @Bill Karvin I removed the offset as you suggested. Following query worked amazingly well,
DROP PROCEDURE IF EXISTS insert_identifierdataset;
DELIMITER $$
CREATE PROCEDURE insert_data()
BEGIN
DECLARE i INT DEFAULT 0;
DECLARE limitSize INT DEFAULT 2000;
DECLARE maxId INT DEFAULT 0;
SET maxId = (SELECT MAX(id) FROM Table1);
WHILE i <= maxId DO
START TRANSACTION;
INSERT IGNORE INTO Table2(id, field1, field2)
SELECT id, field3, field4
FROM Table1
WHERE id> i
ORDER BY id ASC
LIMIT limitSize;
COMMIT;
SET i = i + limitSize;
END WHILE;
END$$
Upvotes: 2
Reputation: 562398
Any INSERT ... SELECT ...
query does acquire a SHARED lock on the rows it reads from the source table in the SELECT. But by processing smaller chunks of rows, the lock doesn't last too long.
The query with LIMIT ... OFFSET
is going to be slower and slower as you advance through the source table. At 10,000 rows per chunk, you need to run that query 10,000 times, each one has to start over and scan through the table to reach the new OFFSET.
No matter what you do, copying 100 million rows is going to take a while. It's doing a lot of work.
I would use pt-archiver, a free tool designed for this purpose. It processes the rows in "chunks" (or subsets). It will dynamically adjust the size of the chunks so that each chunk takes 0.5 seconds.
The biggest difference between your method and pt-archiver is that pt-archiver doesn't use LIMIT ... OFFSET
, it walks along the primary key index, selecting chunks of row by value instead of by position. So every chunk is read more efficiently.
Re your comment:
I expect that making the batch size smaller — and increasing the number of iterations — will make the performance problem worse, not better.
The reason is that when you use LIMIT
with OFFSET
, every query has to start over at the start of the table, and count the rows up to the OFFSET
value. This gets longer and longer as you iterate through the table.
Running 20,000 expensive queries using OFFSET
will take longer than running 10,000 similar queries. The most expensive part will not be reading 5,000 or 10,000 rows, or inserting them into the destination table. The expensive part will be skipping through ~50,000,000 rows, over and over again.
Instead, you should iterate over the table by values not by offsets.
INSERT IGNORE INTO Table2(id, field2, field3)
SELECT f1, f2, f3
FROM Table1
WHERE id BETWEEN rowOffset AND rowOffset+limitSize;
Before the loop, query the MIN(id) and MAX(id), and start rowOffset
at the min value, and loop up to the max value.
This is the way pt-archiver works.
Upvotes: 7
Reputation: 823
Block is the operative word. Hopefully you are using InnoDB (which blocks at the record level) and not MyIsam (Which blocks at the table level). Not know the complexity of the data or the hardware underneath, 10K records per loop may be too big.
Upvotes: 0