Demian
Demian

Reputation: 372

PHP - Parsing large XML files and removing duplicate nodes

I have an application written in Symfony 1. Every day, the system administrators upload an XML with a list of their new clients.

Just recently, they lost a lot of the database records containing information about their clients.

The good news is, that information was stored somewhere else (besides their main database), so they were able to generate an XML containing data about all their clients.

The problem is that such XML is not only huge (about 25MB) but it also has a lot of repeated nodes.

They are used to upload much smaller files (arround 300kb in size) for they used to upload them on a daily basis , so in uploading this file, it surpasses PHP's default 30 second running time causing the script to stop its execution.

I would like to know if there is a fast way of processing this XML file and removing all the duplicate nodes using PHP code.

The XML has the following format

<?xml version="1.0" encoding="ISO-8859-1" ?>
 <Clientes>
  <Cliente>
  <clienteCodigo>1Z</clienteCodigo>
  <nombreCliente>COMPANY NAME</nombreCliente>
  <telefonoCliente>011-4444-4555</telefonoCliente>
  </Cliente>
  <Cliente>
  <clienteCodigo>1Z</clienteCodigo>
  <nombreCliente>COMPANY NAME</nombreCliente>
  <telefonoCliente>011-4444-4555</telefonoCliente>
  </Cliente>
  <Cliente>
  <clienteCodigo>2A</clienteCodigo>
  <nombreCliente>COMPANY NAME 2</nombreCliente>
  <telefonoCliente>011-8888-4646</telefonoCliente>
  </Cliente>
  <Cliente>
  <clienteCodigo>2A</clienteCodigo>
  <nombreCliente>COMPANY NAME 2</nombreCliente>
  <telefonoCliente>011-8888-4646</telefonoCliente>
  </Cliente>
 </Clientes>

Thanks in advance.

EDIT: This is the code that is currently looping trough the elements (I did not write it).

 foreach($clientes->getElementsByTagName("Cliente") as $clienteElement)

    { 

        $clienteNew = Doctrine::getTable('Cliente')->findOneByCliente_codigo($clienteElement->getElementsByTagName("clienteCodigo")->item(0)->nodeValue);

        if (!$clienteNew)

        {

            $clienteNew = new Cliente();               

            $clienteNew->cliente_codigo = $clienteElement->getElementsByTagName("clienteCodigo")->item(0)->nodeValue;            

        }

        $clienteNew->nombre = $clienteElement->getElementsByTagName("nombreCliente")->item(0)->nodeValue;    
        $clienteNew->telefono = $clienteElement->getElementsByTagName("telefonoCliente")->item(0)->nodeValue;        
        $clienteNew->save();                          
    }

Upvotes: 1

Views: 830

Answers (1)

Udan
Udan

Reputation: 5609

Allthough I'm a big fan of Symfony and Doctrine... for that ammount of inserts, updates and checks the fastest way is something like this, using php pdo:

  1. set a unique index on the column you want to be unique and check for in your current code

  2. the pseudo code:

    $conn = new \PDO("mysql:host= .... your connection string here");
    $conn->beginTransaction(); //this speeds up a lot
    $sql = "INSERT INTO tablename (field list) VALUES (parameters matching field list) ON DUPLICATE KEY UPDATE field=param pairs";
    
    foreach($clientes->getElementsByTagName("Cliente") as $clienteElement) {
        $q->execute(array(param => xml_value));
    }
    
    $conn->commit();
    

and like this you can process a lot more data then with the code you work with now.

if this doesn't help... then you might consider extending the time allowed for php scripts to run: http://php.net/function.set-time-limit

Upvotes: 1

Related Questions