pavel
pavel

Reputation: 27082

How to process CSV with 100k+ lines in PHP?

I have a CSV file with more than 100.000 lines, each line has 3 values separated by semicolon. Total filesize is approx. 5MB.

CSV file is in this format:

stock_id;product_id;amount
==========================
1;1234;0
1;1235;1
1;1236;0
...
2;1234;3
2;1235;2
2;1236;13
...
3;1234;0
3;1235;2
3;1236;0
...

We have 10 stocks which are indexed 1-10 in CSV. In database we have them saved as 22-31.

CSV is sorted by stock_id, product_id but I think it doesn't matter.

What I have

<?php

session_start();

require_once ('db.php');

echo '<meta charset="iso-8859-2">';

// convert table: `CSV stock id => DB stock id`
$stocks = array(
    1  => 22,
    2  => 23,
    3  => 24,
    4  => 25,
    5  => 26,
    6  => 27,
    7  => 28,
    8  => 29,
    9  => 30,
    10 => 31
);

$sql = $mysqli->query("SELECT product_id FROM table WHERE fielddef_id = 1");

while ($row = $sql->fetch_assoc()) {
    $products[$row['product_id']] = 1;
}

$csv = file('export.csv');

// go thru CSV file and prepare SQL UPDATE query
foreach ($csv as $row) {
    $data = explode(';', $row);
    // $data[0] - stock_id
    // $data[1] - product_id
    // $data[2] - amount

    if (isset($products[$data[1]])) {
        // in CSV are products which aren't in database
        // there is echo which should show me queries
        echo "  UPDATE t 
                SET value = " . (int)$data[2] . " 
                WHERE   fielddef_id = " . (int)$stocks[$data[0]] . " AND 
                        product_id = '" . $data[1] . "' -- product_id isn't just numeric
                LIMIT 1<br>";
    }
}

Problem is that writing down 100k lines by echo is soooo slow, takes long minutes. I'm not sure what MySQL will do, if it will be faster, or take ± the same time. I have no testing machine here, so I'm worry about testing in on prod server.

My idea was to load CSV file into more variables (better array) like below, but I don't know why.

$csv[0] = lines 0      - 10.000;
$csv[1] = lines 10.001 - 20.000;
$csv[2] = lines 20.001 - 30.000;
$csv[3] = lines 30.001 - 40.000;
etc. 

I found eg. Efficiently counting the number of lines of a text file. (200mb+), but I'm not sure how it can help me.

When I replace foreach for print_r, I get dump in < 1 sec. The task is to make the foreach loop with database update faster.

Any ideas how to updates so many records in database?
Thanks.

Upvotes: 3

Views: 4396

Answers (5)

Anthony
Anthony

Reputation: 37045

Like I said in the comment, use SPLFileObject to iterate over the CSV file. Use Prepared statements to reduce performance overhead of calling the UPDATE in each loop. Also, merge your two queries together, there isn't any reason to pull all of the product rows first and check them against the CSV. You can use a JOIN to ensure that only those stocks in the second table that are related to the product in the first and that is the current CSV row will get updated:

/* First the CSV is pulled in */
$export_csv = new SplFileObject('export.csv');
$export_csv->setFlags(SplFileObject::READ_CSV | SplFileObject::DROP_NEW_LINE | SplFileObject::READ_AHEAD);
$export_csv->setCsvControl(';');

/* Next you prepare your statement object */
$stmt = $mysqli->prepare("
UPDATE stocks, products 
SET value = ?
WHERE
stocks.fielddef_id = ? AND 
product_id = ? AND
products.fielddef_id = 1
LIMIT 1
");

$stmt->bind_param('iis', $amount, $fielddef_id, $product_id);

/* Now you can loop through the CSV and set the fields to match the integers bound to the prepared statement and execute the update on each loop. */

foreach ($export_csv as $csv_row) {
    list($stock_id, $product_id, $amount) = $csv_row;
    $fielddef_id = $stock_id + 21;

    if(!empty($stock_id)) {
        $stmt->execute();
    }
}

$stmt->close();

Upvotes: 2

pavel
pavel

Reputation: 27082

Due to answers and comments for the question, I have the solution. The base for that is from @Dave, I've only updated it to pass better to question.

<?php

require_once 'include.php';

// stock convert table (key is ID in CSV, value ID in database)
$stocks = array(
    1  => 22,
    2  => 23,
    3  => 24,
    4  => 25,
    5  => 26,
    6  => 27,
    7  => 28,
    8  => 29,
    9  => 30,
    10 => 31
);

// product IDs in CSV (value) and Database (product_id) are different. We need to take both IDs from database and create an array of e-shop products
$sql = mysql_query("SELECT product_id, value FROM cms_module_products_fieldvals WHERE fielddef_id = 1") or die(mysql_error());

while ($row = mysql_fetch_assoc($sql)) {
    $products[$row['value']] = $row['product_id'];
}

$handle = fopen('import.csv', 'r');
$i = 1;

while (($data = fgetcsv($handle, 1000, ';')) !== FALSE) {
    $p_id = (int)$products[$data[1]];

    if ($p_id > 0) {
        // if product exists in database, continue. Without this condition it works but we do many invalid queries to database (... WHERE product_id = 0 updates nothing, but take a time)
        if ($i % 300 === 0) {
            // optional, we'll see what it do with the real traffic
            sleep(1);
        }

        $updatesql = "UPDATE table SET value = " . (int)$data[2] . " WHERE fielddef_id = " . $stocks[$data[0]] . " AND product_id = " . (int)$p_id . " LIMIT 1";
        echo "$updatesql<br>";//for debug only comment out on live
        $i++;
    }
}

// cca 1.5sec to import 100.000k+ records
fclose($handle);

Upvotes: 3

Edson Medina
Edson Medina

Reputation: 10269

Updating every record every time will be too expensive (mostly due to seeks, but also from writing).

You should TRUNCATE the table first and then insert all the records again (assuming you won't have external foreign keys linking to this table).

To make it even faster, you should lock the table before the insert and unlock it afterwards. This will prevent the indexing from happening at every insert.

Upvotes: 0

Dave
Dave

Reputation: 3288

Something like this (please note this is 100% untested and off top of my head may need some tweaking to actually work :) )

//define array may (probably better ways of doing this
$stocks = array(
    1  => 22,
    2  => 23,
    3  => 24,
    4  => 25,
    5  => 26,
    6  => 27,
    7  => 28,
    8  => 29,
    9  => 30,
    10 => 31
);

$handle = fopen("file.csv", "r")); //open file
while (($data = fgetcsv($handle, 1000, ";")) !== FALSE) {
    //loop through csv

    $updatesql = "UPDATE t SET `value` = ".$data[2]." WHERE   fielddef_id = ".$stocks[$data[0]]." AND product_id = ".$data[1];
   echo "$updatesql<br>";//for debug only comment out on live
}

There is no need to do your initial select since you're only ever setting your product data to 1 anyway in your code and it looks from your description that your product id's are always correct its just your fielddef column which has the map.

Also just for live don't forget to put your actual mysqli execute command in on your $updatesql;

To give you a comparison to actual usage code (I can benchmark against!) This is some code I use for an importer of an uploaded file (its not perfect but it does its job)

if (isset($_POST['action']) && $_POST['action']=="beginimport") {
            echo "<h4>Starting Import</h4><br />";
            // Ignore user abort and expand time limit 
            //ignore_user_abort(true);
            set_time_limit(60);
                if (($handle = fopen($_FILES['clientimport']['tmp_name'], "r")) !== FALSE) {
                    $row = 0;
                    //defaults 
                    $sitetype = 3;
                    $sitestatus = 1;
                    $startdate = "2013-01-01 00:00:00";
                    $enddate = "2013-12-31 23:59:59";
                    $createdby = 1;
                    //loop and insert
                    while (($data = fgetcsv($handle, 10000, ",")) !== FALSE) {  // loop through each line of CSV. Returns array of that line each time so we can hard reference it if we want.
                        if ($row>0) {
                            if (strlen($data[1])>0) {
                                $clientshortcode = mysqli_real_escape_string($db->mysqli,trim(stripslashes($data[0])));
                                $sitename = mysqli_real_escape_string($db->mysqli,trim(stripslashes($data[0]))." ".trim(stripslashes($data[1])));
                                $address = mysqli_real_escape_string($db->mysqli,trim(stripslashes($data[1])).",".trim(stripslashes($data[2])).",".trim(stripslashes($data[3])));
                                $postcode = mysqli_real_escape_string($db->mysqli,trim(stripslashes($data[4])));
                                //look up client ID
                                $client = $db->queryUniqueObject("SELECT ID FROM tblclients WHERE ShortCode='$clientshortcode'",ENABLE_DEBUG);

                                if ($client->ID>0 && is_numeric($client->ID)) {
                                    //got client ID so now check if site already exists we can trust the site name here since we only care about double matching against already imported sites.
                                    $sitecount = $db->countOf("tblsites","SiteName='$sitename'");
                                    if ($sitecount>0) {
                                        //site exists
                                        echo "<strong style=\"color:orange;\">SITE $sitename ALREADY EXISTS SKIPPING</strong><br />";
                                    } else {
                                        //site doesn't exist so do import
                                        $db->execute("INSERT INTO tblsites (SiteName,SiteAddress,SitePostcode,SiteType,SiteStatus,CreatedBy,StartDate,EndDate,CompanyID) VALUES 
                                        ('$sitename','$address','$postcode',$sitetype,$sitestatus,$createdby,'$startdate','$enddate',".$client->ID.")",ENABLE_DEBUG);
                                        echo "IMPORTED - ".$data[0]." - ".$data[1]."<br />";
                                    }
                                } else {
                                    echo "<strong style=\"color:red;\">CLIENT $clientshortcode NOT FOUND PLEASE ENTER AND RE-IMPORT</strong><br />";
                                }
                                fcflush();
                                set_time_limit(60); // reset timer on loop
                            }
                        } else {
                            $row++;
                        }
                    } 
                    echo "<br />COMPLETED<br />";
                }
                fclose($handle);
                unlink($_FILES['clientimport']['tmp_name']);
            echo "All Imports finished do not reload this page";
        }

That imported 150k rows in about 10 seconds

Upvotes: 3

Seer
Seer

Reputation: 5237

Make the query bigger, i.e. use the loop to compile a larger query. You may need to split it up into chunks (e.g. process 100 at a time), but certainly don't do one query at a time (applies for any kind, insert, update, even select if possible). This should greatly increase the performance.

It's generally recommended that you don't query in a loop.

Upvotes: 0

Related Questions