Chris Maness
Chris Maness

Reputation: 1676

Parsing and Inserting 10 million records into SQL

So I have ~10 million records in an excel file that have to be parsed out in a specific way (I can't just convert to CSV and insert it like that) and inserted into different tables of a mysql database. I've gotten it down from taking all night to taking only a couple hours. However I would like to decrease this even further. Anyone have any tricks or tips that could help me? I'm using Java and JDBC to parse and connect.

Upvotes: 2

Views: 3002

Answers (5)

Jon Black
Jon Black

Reputation: 16559

An idea...

Create a staging (temporary) database in mysql with a table called excel_staging that matches the structure of your excel file - use myisam engine for this table.

Use load data infile to load the excel file (saved as csv) into the excel_staging table - shouldn't take more than a few mins to populate especially as it's myisam.

truncate table excel_staging;

load data infile 'excel_staging.csv'
into table excel_staging
fields terminated by...
lines terminated by..
(
field1,
field2,
...
);

Write lots of select into outfile statements that extract data from the excel_staging table into individual csv files that you will use to load into your individual innodb production database tables. You can be really creative at this point if necessary - you may even have to load extra data to support joins etc so you can generate a nicely formatted csv output.

select distinct cust_id, name into outfile 'customers.csv' 
fields termniated by...
lines terminated by...
from
 excel_staging
order by
 cust_id; -- order for innodb import

select distinct dept_id, name into outfile 'departments.csv' 
fields termniated by...
lines terminated by...
from
 excel_staging
order by
 dept_id;

Load the nicely formatted, cleansed and orderd by primary key csv files into your production innodb tables using load data infile...

load data infile 'customers.csv'
into table customers
fields terminated by...
lines terminated by..
(
cust_id,
name
);

...

Excluding the time to code the solution (30 mins say) should be able to load into staging, output into csv and load into production tables in about ermm... 6 mins end to end.

Hope this helps.

Upvotes: 2

gouki
gouki

Reputation: 4402

A few JDBC performance tips, set your connection object's autoCommit to false. But make sure to commit after a significant number of inserts (every 100K or more). Also, use and reuse a PreparedStatement object over a plain Statement object.

Upvotes: 1

joeslice
joeslice

Reputation: 3464

Mysql allows you to load from a file. Perhaps what you should do is: read 10000 records and create a file. Start that running load data infile in parallel while you start reading the next 10000 records.

So this should get you closer to a fast solution:

  1. Parallelize the read and load
  2. Instead of indiviual inserts, use bulk data load tools

Upvotes: 5

Chris Eberle
Chris Eberle

Reputation: 48795

Make sure to disable foreign key checks while you're doing your inserting (only affects InnoDB), there's a pretty drastic speedup. And then of course re-enable foreign keys when you're done.

Upvotes: 1

Justin Thomas
Justin Thomas

Reputation: 5848

Look into using executeBatch and doing blocks of 1000 or so. That will help a lot.

Upvotes: 3

Related Questions