Reputation: 125855
I have many tens of thousands of rows of data that need to be inserted into a MySQL InnoDB table from a remote client. The client (Excel VBA over MySQL ODBC connector via ADO) can either generate a CSV and perform a LOAD DATA LOCAL INFILE
, or else can prepare an enormous INSERT INTO ... VALUES (...), (...), ...
statement and execute that.
The former requires some rather ugly hacks to overcome Excel's inability to output Unicode CSV natively (it only writes CSV in the system locale's default codepage, which in many cases is a single-byte character set and therefore quite limited); but the MySQL documentation suggests it could be 20 times faster than the latter approach (why?), which also "feels" as though it might be less stable due to the extremely long SQL command.
I have not yet been able to benchmark the two approaches, but I would be very interested to hear thoughts on likely performance/stability issues.
Upvotes: 1
Views: 2470
Reputation: 160
I'm thinking maybe a hybrid solution would work well here... As in...
First create a prepared statement for performance
PREPARE stmt1 FROM 'INSERT INTO table (column1, column2, ...) VALUES (?, ?, ...)';
Observe that the ? marks are actual syntax - you use a question mark wherever you intend to eventually use a value parsed from the CSV file.
Write a procedure or function that opens the .CSV file and enters into a loop that reads the contents one row at a time (one record at a time), storing the values of the parsed columns in separate variables.
Then, within this loop, just after reading a record into local variables, you set the values in the prepared statement to your current record in local variables, as in...
SET @a = 3;
SET @b = 4;
There should be the same number of SET statements as there are columns in the CSV file. If not, you have missed something. The order is extremely important as you must set the values according to the position of the ? marks in the prepared statement. This means you will have to ensure the SET statements match column for column with the columns in your INSERT statement.
After setting all the parameters for the prepared statement, you then execute it.
EXECUTE stmt1 USING @a, @b;
This then is the end of the loop. Just after exiting the loop (after reaching end of file of the CSV), you must release the prepared statement, as in...
DEALLOCATE PREPARE stmt1;
Important things to keep in mind are ...
Make sure you prepare the INSERT statement before entering into the loop reading records, and make sure you DEALLOCATE the statement after exiting the loop.
Prepared statements allow the database to pre-compile and optimize the statement one time, then execute it multiple times with changing parameter values. This should result in a nice performance increase.
I am not certain about MySQL, but some databases also let you specify a number of rows to cache before a prepared statement actually executes across the network - if this is possible with MySQL, doing so will allow you to tell the database that although you are calling execute on the statement for every row read from the CSV, that the database should batch up the statements up to the specified number of rows, and only then execute across the network. In this way performance is greatly increased as the database may batch up 5 or 10 INSERTS and execute them using only one round trip over the network instead of one per row.
Hope this helps and is relevant. Good Luck!
Rodney
Upvotes: 1