jskidd3
jskidd3

Reputation: 4783

Drastically speed up 10 million row insert in MySQL database

$files = glob('dataset/*.xml');

foreach ($files as $key => $txc) {
    $txcDoc = new DOMDocument();
    $txcDoc->load($txc);

    $operators = $txcDoc->getElementsByTagName("Operators");
    foreach ($operators as $operatorTag) {
        foreach ($operatorTag->getElementsByTagName("Operator") as $operator) {
            $reference = $operator->getAttribute("id");
            @$nationalOperatorCode = $operator->getElementsByTagName("NationalOperatorCode")->item(0)->nodeValue;
            $operatorCode = $operator->getElementsByTagName("OperatorCode")->item(0)->nodeValue;
            $operatorShortName = $operator->getElementsByTagName("OperatorShortName")->item(0)->nodeValue;
            @$operatorNameOnLicense = $operator->getElementsByTagName("OperatorNameOnLicense")->item(0)->nodeValue;
            @$tradingName = $operator->getElementsByTagName("TradingName")->item(0)->nodeValue;

            $operatorSQL = "INSERT IGNORE INTO `operator` (`reference`, `national_operator_code`, `operator_code`, `operator_short_name`, `operator_name_on_license`, `trading_name`) VALUES (:reference, :nationalOperatorCode, :operatorCode, :operatorShortName, :operatorNameOnLicense, :tradingName);";

            $operatorStmt = $conn->prepare($operatorSQL);
            $operatorStmt->execute(array(':reference' => $reference, ':nationalOperatorCode' => $nationalOperatorCode, ':operatorCode' => $operatorCode, ':operatorShortName' => $operatorShortName, ':operatorNameOnLicense' => $operatorNameOnLicense, ':tradingName' => $tradingName));
        }
    }
}

The PHP above cycles through 78,654 XML files (1.2gb), parses their data and then inserts the data into a MySQL database. The snippet above is only about a 10th of the file however, there are another 10-15 foreach constructs just like the foreach ($operators one. (To view the whole file click here)

My issue is that it's taking 10-20 minutes to insert 250 files' data. I need to drastically increase the speed so that all the data will be inserted in < 1-2 hours.

The database engine is MySQL and the tables are all InnoDB. How can I go about speeding these inserts up?

Upvotes: 1

Views: 776

Answers (2)

Mikk
Mikk

Reputation: 2229

I think you don't have to call $conn->prepare($operatorSQL); in every loop, you could just prepare your statement outside of the loop and execute it inside of the loop. I don't know how much performance improvement you actually are going to get, but when you have millions of rows it would make sense to refactor it this way.

Also, @ operators (aka. silence operator) is known to slow down PHP scripts.

Maybe code itself is fine, but if you are using remote sql server it well could be a bandwidth problem.

You might consider not using php at all - Java for example would be a lot faster for parsing huge amounts of data. When using Java you can create multithreaded program which simultaneously can parse multiple files at a time.

And for last but not least - you can upgrade hardware. Reading several thousand files is going to be heavy for disk I/O. Using SSD instead of HDD can speed up things.

Like others have mentioned, you should actually benchmark your code. Xdebug is well known tool for php which can generate profiling information. Based on profiling information you can find bottlenecks in your code and when you actually know where your problem is you can consider the possibilities mentioned before.

Upvotes: 2

arkascha
arkascha

Reputation: 42915

Obviously there are many things to look at and you don't exactly offer much details...

However one thing that easily speeds up such mass inserts typically is:

  1. remove all indexes defined on the table(s) where data is inserted into

  2. insert the data

  3. recreate all indexes as defined before

The reasons why this speeds things up is that the indexes have to be reorganized and written only once instead of for each and every single insert operation. I was often surprised how much difference that made...

If in addition you really want to tune your php implementation then it makes sense to use a profiler in an example run to understand where exactly time is spent. Concentrate on those parts really sticking out. But keep in mind that there is no sense in investing endless time into perfectionism. Have a CPU work is much cheaper than wasting your time :-)

Upvotes: 2

Related Questions