BostonPHPGuy
BostonPHPGuy

Reputation: 21

What is the best way to import a 1.4 million line 800mb tab-delimited file?

I have a 800mb tab-delimited file with 1.4 million records that I need to get into a MySQL database. Before I run anything that cripples my local machine I was wondering what the best way is. I'm going to need to do this every 4-6 weeks when I get a new file with updated data.

I've been thinking one of the following may be best, but am not sure:

Option 1

$input = file('data-file.tab', FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);

$filedata= array();

foreach ($input AS $line) {

    $data = explode("\t", $line);

    // function to import data to MySQL

}

Option 2

Split up the file into manageable chunks and import via phpMyAdmin.

Upvotes: 0

Views: 629

Answers (4)

Jeff Muzzy
Jeff Muzzy

Reputation: 94

The short answer is your need to worry about your machine and the MySQL Server you are going to import the records into.

You want to have made a optimization configuration pass (including restarting your mysql to take the my.cnf changes) on your MySQL before trying below.

Once you have done that, You want to look at LOAD DATA INFILE MySQL command.

http://dev.mysql.com/doc/refman/5.1/en/load-data.html

Example:

LOAD DATA INFILE 
  '/tmp/test.txt' 
INTO TABLE 
  testTable
FIELDS TERMINATED BY 
  ','  
LINES TERMINATED BY 
  '\r\n'  -- Windows created file
ENCLOSED BY 
  '"' 
IGNORE 
  1 LINES
(col1, col2, col3, col4, col5...) 

Upvotes: 0

Havenard
Havenard

Reputation: 27854

Because it has 800MB, you certainly don't want to load the entire file to the memory at once, so file() is not an option. Instead, you should read it line by line.

You can do that in two basic ways. First you will create a file descriptor to read this file using:

$fd = fopen('data-file.tab', 'r');

Now you have two options to iterate through the lines of this file:

Option 1: Use fgets() to read each line. It will include the line delimiter, so remember to trim() it before explode().

Example:

while ($line = fgets($fd))
{
    $line = explode("\t", trim($line));
    // do what you must with it
}

Option 2: Is it simply tab separated or a CSV-like file? You might consider using fgetcsv() in the last case.

while ($line = fgetcsv($fd, 0, "\t"))
{
    // $line is already ready and formatted, do what you must with it
}

Additional performance considerations

The memory problem is solved, but including those records in the MySQL can be extremely time consuming, specially if you are going to run the MySQL queries directly from PHP. There are a few ways to boost that.

First, don't INCLUDE each row separately. Instead of generating a INSERT INTO table... for each record, consider buffering the records and inserting them in bundles of 100 or more for each INSERT query. This will speed up things A LOT.

For additional speed, you can outsource the database part to a parallel process, in this case the mysql console client itself. Istead of running the queries you generate with a MySQL extension like mysqli_*, you can simply echo them or send them to a mysql process created with popen().

Example:

$ php generate_queries.php | mysql -u user -ppassword database

Or doing it all in PHP code with popen():

$mysql = popen('mysql -u user -ppassword database', 'w');
...
fwrite($mysql, 'INSERT INTO table...');

Upvotes: 0

Bill Karwin
Bill Karwin

Reputation: 562280

I wouldn't do this through phpmyadmin. I know that may be the more familiar interface for you, but if you're managing industrial-sized datasets, you should get used to using other tools that are more efficient and reliable.

Go to the command-line and use mysqlimport. This is the command-line interface the the LOAD DATA statement. You can either do this on a client host and import remotely, or else you can upload the file to the database server and import locally.

To split up the file, I'd use pt-fifo-split. This makes it easy to write a shell loop over chunks of the file, without any need to physically split the file into multiple smaller files.

Upvotes: 3

Step 1: use the right tools for the job. If you want data imported into your database, turn the data into db-readable form, and then import it into your database the obvious way (i.e. using the database's import functions).

You can use PHP to write a tiny script that will run through the data file line by line and convert it to the SQL form you need, and write that to an out.sql file, which you can then import normally into your DB. Done. Wrap your out.sql in a transaction start/end call to make the import a million times faster for bonus points.

Upvotes: 1

Related Questions