jbdotdev
jbdotdev

Reputation: 173

How to improve INSERT performance on a very large MySQL table

I am working on a large MySQL database and I need to improve INSERT performance on a specific table. This one contains about 200 Millions rows and its structure is as follows:

(a little premise: I am not a database expert, so the code I've written could be based on wrong foundations. Please help me to understand my mistakes :) )

CREATE TABLE IF NOT EXISTS items (
    id INT NOT NULL AUTO_INCREMENT,
    name VARCHAR(200) NOT NULL,
    key VARCHAR(10) NOT NULL,
    busy TINYINT(1) NOT NULL DEFAULT 1,
    created_at DATETIME NOT NULL,
    updated_at DATETIME NOT NULL,

    PRIMARY KEY (id, name),
    UNIQUE KEY name_key_unique_key (name, key),
    INDEX name_index (name)
) ENGINE=MyISAM
PARTITION BY LINEAR KEY(name)
PARTITIONS 25;

Every day I receive many csv files in which each line is composed by the pair "name;key", so I have to parse these files (adding values created_at and updated_at for each row) and insert the values into my table. In this one, the combination of "name" and "key" MUST be UNIQUE, so I implemented the insert procedure as follows:

CREATE TEMPORARY TABLE temp_items (
    id INT NOT NULL AUTO_INCREMENT,
    name VARCHAR(200) NOT NULL, 
    key VARCHAR(10) NOT NULL, 
    busy TINYINT(1) NOT NULL DEFAULT 1,  
    created_at DATETIME NOT NULL, 
    updated_at DATETIME NOT NULL,  
    PRIMARY KEY (id) 
    ) 
ENGINE=MyISAM;

LOAD DATA LOCAL INFILE 'file_to_process.csv' 
INTO TABLE temp_items
FIELDS TERMINATED BY ',' 
OPTIONALLY ENCLOSED BY '\"' 
(name, key, created_at, updated_at); 

INSERT INTO items (name, key, busy, created_at, updated_at) 
(
    SELECT temp_items.name, temp_items.key, temp_items.busy, temp_items.created_at, temp_items.updated_at 
    FROM temp_items
) 
ON DUPLICATE KEY UPDATE busy=1, updated_at=NOW();

DROP TEMPORARY TABLE temp_items;

The code just shown allows me to reach my goal but, to complete the execution, it employs about 48 hours, and this is a problem. I think that this poor performance are caused by the fact that the script must check on a very large table (200 Millions rows) and for each insertion that the pair "name;key" is unique.

How can I improve the performance of my script?

Thanks to all in advance.

Upvotes: 13

Views: 33534

Answers (5)

Criminal_Affair_At_SO
Criminal_Affair_At_SO

Reputation: 3473

After spending a huge amount of time and going through many documents, I am coming back here to this very first question to post everything that I found - it made a huge difference to me.

  • Check for obvious problems with the schema - I won't go into details as this has been covered elsewhere

  • Use InnoDB and not MyISAM

  • Do these before inserting and revert them afterwards:

    ALTER TABLE bigone DISABLE KEYS;

    SET GLOBAL foreign_key_checks = 0;

    SET GLOBAL unique_checks = 0;

    SET GLOBAL autocommit = 0;

    SET GLOBAL innodb_flush_log_at_trx_commit = 0;

    SET GLOBAL innodb_flush_log_at_timeout = 10;

    SET GLOBAL innodb_buffer_pool_size = 80% of RAM

    SET GLOBAL innodb_log_file_size = 1024 * 1024 * 1024

    SET GLOBAL innodb_write_io_threads = 16

  • Always insert data in the primary key order, if using a composite primary key, insert sorted by the first value, then the second, etc..

  • Use the INSERT ... VALUES ( (...), (...) ) form which allows you to insert multiple rows at once

  • Use START TRANSACTION... INSERT... COMMIT between every INSERT which should, ideally, insert at least 1000 or even 10k rows

  • Consider using the even faster form LOAD DATA INFILE if you can

  • If an uncompressed table's clustering (primary) index does not fit in innodb_buffer_pool_size but a compressed one does, the compressed table will be faster - a very good sign of this problem is that insertions generate read and write ops instead of only write ops

  • If the clustering (primary) index of the table fits in innodb_buffer_pool_size, observe the output of SHOW ENGINE INNODB STATUS - the amount of dirty pages in the buffer. When writing, MariaDB will tend to delay the writes - which is good especially if you are using multiple threads - since it will write the data in large chunks slightly reordering it in primary key order if needed. The downside howeevr is that these dirty pages might not leave enough space for the primary index. SET GLOBAL innodb_max_dirty_pages_pct <xx%> so that the primary index always fits.

With all of these I was able to bring down the loading of a 600M rows table from more than a week to about 24h.

And remember to revert all settings to their initial values, as some are very dangerous in a production environment!!!

Upvotes: 2

Kemal Atik
Kemal Atik

Reputation: 337

By thinking in java ;

  • Divide the object list into the partitions and generate batch insert statement for each partition.
  • Utilize CPU cores and available db connections efficiently, nice new java features can help to achieve parallelism easily(e.g.paralel, forkjoin) or you can create your custom thread pool optimized with number of CPU cores you have and feed your threads from centralized blocking queue in order to invoke batch insert prepared statements.
  • Decrease the number of indexes on the target table if possible. If foreign key is not really needed, just drop it. Less indexes faster inserts.
  • Avoid using Hibernate except CRUD operations, always write SQL for complex selects.
  • Decrease number of joins in your query, instead forcing the DB, use java streams for filtering, aggregating and transformation.
  • If you feel that you do not have to do, do not combine select and inserts as one sql statement
  • Add rewriteBatchedStatements=true to your JDBC string, it will help to decrease TCP level communication between app and DB.
  • Use @Transactional for the methods that carry out insert batch and write rollback methods yourself.

Upvotes: 0

JAR
JAR

Reputation: 413

You can use the following methods to speed up inserts:

  1. If you are inserting many rows from the same client at the same time, use INSERT statements with multiple VALUES lists to insert several rows at a time. This is considerably faster (many times faster in some cases) than using separate single-row INSERT statements. If you are adding data to a nonempty table, you can tune the bulk_insert_buffer_size variable to make data insertion even faster.

  2. When loading a table from a text file, use LOAD DATA INFILE. This is usually 20 times faster than using INSERT statements.

  3. Take advantage of the fact that columns have default values. Insert values explicitly only when the value to be inserted differs from the default. This reduces the parsing that MySQL must do and improves the insert speed.

Reference: MySQL.com: 8.2.4.1 Optimizing INSERT Statements

Upvotes: 9

Ronald
Ronald

Reputation: 1

You could use

load data local infile ''
REPLACE
into table 

etc...

The REPLACE ensure that any duplicate value is overwritten with the new values. Add a SET updated_at=now() at the end and you're done.

There is no need for the temporary table.

Upvotes: -2

Raymond Nijland
Raymond Nijland

Reputation: 11602

Your linear key on name and the large indexes slows things down.

LINEAR KEY needs to be calculated every insert. http://dev.mysql.com/doc/refman/5.1/en/partitioning-linear-hash.html

can you show us some example data of file_to_process.csv maybe a better schema should be build.

Edit looked more closely

INSERT INTO items (name, key, busy, created_at, updated_at) 
(
    SELECT temp_items.name, temp_items.key, temp_items.busy, temp_items.created_at, temp_items.updated_at 
    FROM temp_items
) 

this will proberly will create a disk temp table, this is very very slow so you should not use it to get more performance or maybe you should check some mysql config settings like tmp-table-size and max-heap-table-size maybe these are misconfigured.

Upvotes: 5

Related Questions