user3916429
user3916429

Reputation: 572

MySQL Optimization for LOAD DATA INFILE

I see everywhere programmers discuting optimisation for fastest LOAD DATA INFILE inserts. But they never explain much their values choices etc, and optimisation depends on environment and also on the actual real needs.

So, would like some explainations on what would be the best values in my mysql config file for reaching the fastest insert possible, please.

My config, an intel dual-core @ 3.30 GHz, 4Gb DDR4 RAM (windows7 says "2.16Gb available" tho because of reserved memory).

My backup.csv file is plaintext as about 5 billions entries, so its a huge 500Gb file size like this schem (but hexadecimal string 64 length):

 "sdlfkjdlfkjslfjsdlfkjslrtrtykdjf";"dlksfjdrtyrylkfjlskjfssdlkfjslsdkjf"

Only two fields in my table and the first one is Unique index. ROW-FORMAT is set on FIXED for space saving questions. And for same reason, fields type is set as BINARY(32).

Im using MyISAM engine. (because innoDB requires much more space!) (MySQL version 5.1.41)

here is the code i planned to use for now :

 ALTER TABLE verification DISABLE KEYS;
 LOCK TABLES verification WRITE;
 LOAD DATA INFILE 'G:\\backup.csv'
      IGNORE INTO TABLE verification
      FIELDS TERMINATED BY ';' ENCLOSED BY '"' LINES TERMINATED BY '\r\n'
      (@myhash, @myverif) SET hash = UNHEX(@myhash), verif = UNHEX(@myverif);
 UNLOCK TABLES;
 ALTER TABLE verification ENABLE KEYS;

As you can see, the command use LOAD DATA INFILE takes the plain text values, turn them into HEX (both are hexadecimal hashes finaly so...)

I heard about the buffer sizes etc, and all those values from the MySQL config file. What should i change, and what would be the best values please? As you can see, i lock the table and also disable keys for speeding it already.

I also read on documentation :

 myisamchk --keys-used=0 -rq /var/lib/mysql/dbName/tblName

Doing that before the insert would speed it up also. But what is really tblName ? (because i have a .frm file, a .MYD and a .MYI, so which one am i supposed to point?)

Here are the lasts short hints i did read about optimisation

EDIT : Forgot to tell, everything is localhost.

Upvotes: 1

Views: 4164

Answers (2)

user3916429
user3916429

Reputation: 572

So, i finaly managed to Insert my 500GB database of more than 3 billions entries, in something like 5 hours.

i have tried many ways, and while rebuilding the Primary Index i was stuck with this error ERROR 1034 (HY000): Duplicate key 1 for record at 2229897540 against new record at 533925080.

I will explain now how i achieved to complete my insert:

  • i sorted my .csv file with GNU CoreUtils : sort.exe (im on windows) keep in mind doing that, you need 1.5x your csv file as free space, for temporary files. (so counting the .csv file, its 2.5x finaly)
  • You create the table, with indexes and all.
  • Execute mysqladmin flush-tables -u a_db_user -p
  • Execute myisamchk --keys-used=0 -rq /var/lib/mysql/dbName/tblName
  • Insert the data : (DO NOT USE ALTER TABLE tblname DISABLE KEYS; !!!)

    LOCK TABLES verification WRITE;
    LOAD DATA INFILE 'G:\\backup.csv'
        IGNORE INTO TABLE verification
        FIELDS TERMINATED BY ';'
        ENCLOSED BY '"'
        LINES TERMINATED BY '\r\n'
        (@myhash, @myverif) SET hash = UNHEX(@myhash), verif = UNHEX(@myverif);
    UNLOCK TABLES;
  • when data is inserted, you rebuild the indexes by Executing myisamchk --key_buffer_size=1024M --sort_buffer_size=1024M -rqq /var/lib/mysql/dbName/tblName (note the -rqq, doubling the q will ignore the possible duplicate error by trying to repair them (Instead of just stopping the inserts after many hours of waiting!)

  • Execute mysqladmin flush-tables -u a_db_user -p

And i was done!

  • I noticed a huge boost in speed if the .csv file is on another drive than the database, and same for the sort operation, put temp file in another drive. (Read/Write speed as not both datas in the same place)

source of this again was here : Credits here to this solution

Upvotes: 3

Rick James
Rick James

Reputation: 142238

I'm pretty sure it is the verification, not verification.MYD or the other two. .MYD is data, .MYI is indexes, .frm is schema.

How long are the strings? Are hex? If 32 hex digits, then don't you want BINARY(16) for the output of the UNHEX?

The long part of the process will probably be ENABLE KEYS, when is when it will be building the index. Do SHOW PROCESSLIST; while it is running -- If it says "using keybuffer", kill it, it will take forever. If is says something like "building by repair", then that it good -- it is sorting, then loading the index efficiently.

You can save 5GB of disk space by setting myisam_data_pointer_size=5 before starting the process. Seems like there is also myisam_index_pointer_size, but it may be defaulted to 5, which is probably correct for your case. (I encountered that setting once on ver 4.0 in about 2004; but never again.)

I don't think key_buffer_size will matter during the load and indexing -- since you really want it not to use the key_buffer. Don't set it so high that you run out of RAM. Swapping is terrible for performance.

Upvotes: 1

Related Questions