Quantum
Quantum

Reputation: 1476

find xml block and replace all, regex to match and call back to overwrite

I need to just do a quick match and replace all that comes from a xml. I don't want to phrase the file since the file is like 100mb and I can't stop that from being the case. So here is the sample data.

     <?xml version="1.0" encoding="UTF-8"?>
    <products>
        <product active="1" on_sale="0" discountable="0">
            <sku>SKUTARGET</sku>
            <name><![CDATA[sdfsdf (NET)]]></name>
            <description><![CDATA[agag adgsgsdg asdgsdg]]></description>
            <keywords></keywords>
            <price>9.000000</price>
            <stock_quantity>35</stock_quantity>
            <reorder_quantity>0</reorder_quantity>
            <height>0.000000</height>
            <length>0.000000</length>
            <diameter>0.000000</diameter>
            <weight>0.000000</weight>
            <color>Black</color>
            <material>PVC</material>
            <barcode>883045010070</barcode>
            <release_date>2008-11-10</release_date>
            <images>
                <image>/sdssd/sdfsd.jpg</image>
                <image>/AL10sdfsds07XO/sdfsd.jpg</image>
            </images>
            <categories>
                <category code="166" video="0" parent="172">sd &amp; Sexy sdf</category>
                <category code="172" video="0" parent="">sd &amp; dddsdsds</category>
                <category code="641" video="0" parent="172">sdfsdf Costume sdfsdfsdf</category>
            </categories>
            <manufacturer code="AL" video="0">sdfsdf sdfs</manufacturer>
            <type code="LI" video="0">sdfsd</type>
        </product>
        <product active="1" on_sale="0" discountable="0">
            <sku>XXXXXXX</sku>
            <name><![CDATA[LEATHER sdfsdf (NET)]]></name>
            <description><![CDATA[asdgsdgsd sad sadg asdg asdg asdg asdg asdg asdg asdg asdg asdg asdg asdg]]></description>
            <keywords></keywords>
            <price>5.000000</price>
            <stock_quantity>36</stock_quantity>
            <reorder_quantity>0</reorder_quantity>
            <height>0.000000</height>
            <length>0.000000</length>
            <diameter>0.000000</diameter>
            <weight>0.000000</weight>
            <color>Black</color>
            <material>Leather</material>
            <barcode>883045300164</barcode>
            <release_date>2008-11-10</release_date>
            <images>
                <image>/AL10sds0XO/sdsdsd.jpg</image>
                <image>/sdsds/AL1sd00XOB.jpg</image>
                <image>/AL1sdsds00XO/sdsds.jpg</image>
            </images>
            <categories>
                <category code="80" video="0" parent="44">sdgsdgsdg</category>
                <category code="181" video="0" parent="172">Sleep &amp; Lounge</category>
            </categories>
            <manufacturer code="AL" video="0">Allure sdsds</manufacturer>
            <type code="LI" video="0">sdsfsdfsd</type>
        </product>
    </products>

What I need is just the one block starting at node products where the sku is a var in this case "SKUTARGET"

        <product active="1" on_sale="0" discountable="0">
            <sku>SKUTARGET</sku>
            <name><![CDATA[sdfsdf (NET)]]></name>
            <description><![CDATA[agag adgsgsdg asdgsdg]]></description>
            <keywords></keywords>
            <price>9.000000</price>
            <stock_quantity>35</stock_quantity>
            <reorder_quantity>0</reorder_quantity>
            <height>0.000000</height>
            <length>0.000000</length>
            <diameter>0.000000</diameter>
            <weight>0.000000</weight>
            <color>Black</color>
            <material>PVC</material>
            <barcode>883045010070</barcode>
            <release_date>2008-11-10</release_date>
            <images>
                <image>/sdssd/sdfsd.jpg</image>
                <image>/AL10sdfsds07XO/sdfsd.jpg</image>
            </images>
            <categories>
                <category code="166" video="0" parent="172">sd &amp; Sexy sdf</category>
                <category code="172" video="0" parent="">sd &amp; dddsdsds</category>
                <category code="641" video="0" parent="172">sdfsdf Costume sdfsdfsdf</category>
            </categories>
            <manufacturer code="AL" video="0">sdfsdf sdfs</manufacturer>
            <type code="LI" video="0">sdfsd</type>
        </product>

Here is the code I'm working with at the moment

    <?php

    ob_start();
    ?> 
    <?xml version="1.0" encoding="UTF-8"?>
    <products>
        <product active="1" on_sale="0" discountable="0">
            <sku>SKUTARGET</sku>
            <name><![CDATA[sdfsdf (NET)]]></name>
            <description><![CDATA[agag adgsgsdg asdgsdg]]></description>
            <keywords></keywords>
            <price>9.000000</price>
            <stock_quantity>35</stock_quantity>
            <reorder_quantity>0</reorder_quantity>
            <height>0.000000</height>
            <length>0.000000</length>
            <diameter>0.000000</diameter>
            <weight>0.000000</weight>
            <color>Black</color>
            <material>PVC</material>
            <barcode>883045010070</barcode>
            <release_date>2008-11-10</release_date>
            <images>
                <image>/sdssd/sdfsd.jpg</image>
                <image>/AL10sdfsds07XO/sdfsd.jpg</image>
            </images>
            <categories>
                <category code="166" video="0" parent="172">sd &amp; Sexy sdf</category>
                <category code="172" video="0" parent="">sd &amp; dddsdsds</category>
                <category code="641" video="0" parent="172">sdfsdf Costume sdfsdfsdf</category>
            </categories>
            <manufacturer code="AL" video="0">sdfsdf sdfs</manufacturer>
            <type code="LI" video="0">sdfsd</type>
        </product>
        <product active="1" on_sale="0" discountable="0">
            <sku>XXXXXXX</sku>
            <name><![CDATA[LEATHER sdfsdf (NET)]]></name>
            <description><![CDATA[asdgsdgsd sad sadg asdg asdg asdg asdg asdg asdg asdg asdg asdg asdg asdg]]></description>
            <keywords></keywords>
            <price>5.000000</price>
            <stock_quantity>36</stock_quantity>
            <reorder_quantity>0</reorder_quantity>
            <height>0.000000</height>
            <length>0.000000</length>
            <diameter>0.000000</diameter>
            <weight>0.000000</weight>
            <color>Black</color>
            <material>Leather</material>
            <barcode>883045300164</barcode>
            <release_date>2008-11-10</release_date>
            <images>
                <image>/AL10sds0XO/sdsdsd.jpg</image>
                <image>/sdsds/AL1sd00XOB.jpg</image>
                <image>/AL1sdsds00XO/sdsds.jpg</image>
            </images>
            <categories>
                <category code="80" video="0" parent="44">sdgsdgsdg</category>
                <category code="181" video="0" parent="172">Sleep &amp; Lounge</category>
            </categories>
            <manufacturer code="AL" video="0">Allure sdsds</manufacturer>
            <type code="LI" video="0">sdsfsdfsd</type>
        </product>
    </products>

    <?php

    $xml_str = ob_get_contents();
    ob_end_clean();

    $tar_sku="SKUTARGET"; // this is the sku of the product block I need to have
    $pat= '/^.*(<product *<sku>'.$tar_sku.'</sku>*</product>).*$/is'; // this should match the block with the sku but no other block

    $replacement='$1';//This should overwrite everything with that found block.

    $returnValue = preg_replace($pat, $replacement, $xml_str);

Any help would be great. Thanks. Jeremy

[edit]

Here is the test code from the suggestion below. As of yet don'ts work. I was expecting to echo back that string of the xml block with that sku matching. no luck yet.

    <?php

    error_reporting(E_ALL);
    ini_set('display_errors', '1');
    umask(0);

    $xml_str = <<<EOD
            <?xml version="1.0" encoding="UTF-8"?>
            <products>
                <product active="1" on_sale="0" discountable="0">
                    <sku>SKUTARGET</sku>
                    <name><![CDATA[sdfsdf (NET)]]></name>
                    <description><![CDATA[agag adgsgsdg asdgsdg]]></description>
                    <keywords></keywords>
                    <price>9.000000</price>
                    <stock_quantity>35</stock_quantity>
                    <reorder_quantity>0</reorder_quantity>
                    <height>0.000000</height>
                    <length>0.000000</length>
                    <diameter>0.000000</diameter>
                    <weight>0.000000</weight>
                    <color>Black</color>
                    <material>PVC</material>
                    <barcode>883045010070</barcode>
                    <release_date>2008-11-10</release_date>
                    <images>
                        <image>/sdssd/sdfsd.jpg</image>
                        <image>/AL10sdfsds07XO/sdfsd.jpg</image>
                    </images>
                    <categories>
                        <category code="166" video="0" parent="172">sd &amp; Sexy sdf</category>
                        <category code="172" video="0" parent="">sd &amp; dddsdsds</category>
                        <category code="641" video="0" parent="172">sdfsdf Costume sdfsdfsdf</category>
                    </categories>
                    <manufacturer code="AL" video="0">sdfsdf sdfs</manufacturer>
                    <type code="LI" video="0">sdfsd</type>
                </product>
                <product active="1" on_sale="0" discountable="0">
                    <sku>XXXXXXX</sku>
                    <name><![CDATA[LEATHER sdfsdf (NET)]]></name>
                    <description><![CDATA[asdgsdgsd sad sadg asdg asdg asdg asdg asdg asdg asdg asdg asdg asdg asdg]]></description>
                    <keywords></keywords>
                    <price>5.000000</price>
                    <stock_quantity>36</stock_quantity>
                    <reorder_quantity>0</reorder_quantity>
                    <height>0.000000</height>
                    <length>0.000000</length>
                    <diameter>0.000000</diameter>
                    <weight>0.000000</weight>
                    <color>Black</color>
                    <material>Leather</material>
                    <barcode>883045300164</barcode>
                    <release_date>2008-11-10</release_date>
                    <images>
                        <image>/AL10sds0XO/sdsdsd.jpg</image>
                        <image>/sdsds/AL1sd00XOB.jpg</image>
                        <image>/AL1sdsds00XO/sdsds.jpg</image>
                    </images>
                    <categories>
                        <category code="80" video="0" parent="44">sdgsdgsdg</category>
                        <category code="181" video="0" parent="172">Sleep &amp; Lounge</category>
                    </categories>
                    <manufacturer code="AL" video="0">Allure sdsds</manufacturer>
                    <type code="LI" video="0">sdsfsdfsd</type>
                </product>
            </products>


    EOD;


    $tar_sku="SKUTARGET"; // this is the sku of the product block I need to have
    $pattern = "~<product .*?<sku>$tar_sku</sku>.*?</product>~is"; 
    $returnValue = preg_match($pattern,$xml_str);

    echo '--'.$returnValue[0];

Upvotes: 1

Views: 1523

Answers (4)

Francis Avila
Francis Avila

Reputation: 31621

Do not use a regex to parse XML. If your concern is memory usage, using a regex will consume much more memory than incremental parsing. Since a regex can only operate on a string, you will need at least 100MB of memory just to hold the file string before you can do anything with it. If you use an incremental XML parser, you can use less memory than the size of the file.

The right tool for this job is XMLReader.

tl;dr

There are two XMLReader parsing implementations in this answer:

  • getmatchingproducts_xml_expand() or getmatchingproducts_xml_noexpand() functions returns a list of all matched products. Memory usage depends on how many matching SKU products are in the source xml.
  • ProductMatcher class is an Iterator (can be used in foreach) that will return matched products incrementally as either a string, DOMDocument, or SimpleXMLElement. It uses about 1MB of memory no matter how big your source XML is or how many products match.

Test File

I created a 120 MB sample file using the format you created. This is the creation code:

function maketestfile() {
    $xml = <<<EOT
        <product active="1" on_sale="0" discountable="0">
            <sku>{{SKU}}</sku>
            <name><![CDATA[LEATHER sdfsdf (NET)]]></name>
            <description><![CDATA[asdgsdgsd sad sadg asdg asdg asdg asdg asdg asdg asdg asdg asdg asdg asdg]]></description>
            <keywords></keywords>
            <price>5.000000</price>
            <stock_quantity>36</stock_quantity>
            <reorder_quantity>0</reorder_quantity>
            <height>0.000000</height>
            <length>0.000000</length>
            <diameter>0.000000</diameter>
            <weight>0.000000</weight>
            <color>Black</color>
            <material>Leather</material>
            <barcode>883045300164</barcode>
            <release_date>2008-11-10</release_date>
            <images>
                <image>/AL10sds0XO/sdsdsd.jpg</image>
                <image>/sdsds/AL1sd00XOB.jpg</image>
                <image>/AL1sdsds00XO/sdsds.jpg</image>
            </images>
            <categories>
                <category code="80" video="0" parent="44">sdgsdgsdg</category>
                <category code="181" video="0" parent="172">Sleep &amp; Lounge</category>
            </categories>
            <manufacturer code="AL" video="0">Allure sdsds</manufacturer>
            <type code="LI" video="0">sdsfsdfsd</type>
        </product>
EOT;
    $fo = fopen('test2.xml', 'wb');
    fwrite($fo, "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n");
    fwrite($fo, "<products>\n");
    $sku = array('SKUTARGET', 'XXXXXXXX', 'SKUY12345', '124432XXK', 'FOO1234BAR');
    for ($i=0; $i < 100000; $i++) { 
        shuffle($sku);
        fwrite($fo, str_replace('{{SKU}}', $sku[0], $xml));
    }
    fwrite($fo, "</products>\n");
    fclose($fo);
}

Memory and Timing Function

function trial($method, $args) {
    //prime the pump
    if (!function_exists($method))
        throw BadFunctionCallException();

    call_user_func_array($method, $args);
    $iter = 2;
    $runtime = 0;
    for ($i=0; $i < $iter; $i++) {
        $start = microtime(true);
        $res = call_user_func_array($method, $args);
        $runtime += microtime(true)-$start;
    }
    return array(
        'peakmem' => memory_get_peak_usage(),
        'mem' => memory_get_usage(),
        'time' => $runtime/$iter,
        'return' => $res,
    );
}
function main($method, $filename) {
    $args = array($filename, 'SKUTARGET');
    $res = trial($method, $args);
    echo "Found products: ",count($res['return']),"\n";
    printf("%30s %3.2f %3.2f %4.3f\n", $method, $res['peakmem']/(1024*1024), $res['mem']/(1024*1024), $res['time']);
}

main($argv[1], $argv[2]);

Regex vs XMLReader

Finally I tested these functions. The first two use the regexes suggested by other answers, and the third one uses XMLReader.

function getmatchingproducts_regex1($xmlfile, $desiredsku) {
    $pattern = "~<product [^<]*<sku>".preg_quote($desiredsku,'~')."</sku>.*?</product>~Sus";
    $xmlstr = file_get_contents($xmlfile);
    preg_match_all($pattern, $xmlstr, $matchingproducts);
    return $matchingproducts;
}

function getmatchingproducts_regex2($xmlfile, $desiredsku) {
    $pattern = "~<product [^<]*+<sku>".preg_quote($desiredsku,'~')."</sku>[^<]*(?:<(?!/product>)[^<]*)*</product>~Su";
    $xmlstr = file_get_contents($xmlfile);
    preg_match_all($pattern, $xmlstr, $matchingproducts);
    return $matchingproducts;
}

function getmatchingproducts_xml_expand($xmlfile, $desiredsku) {
    $r = new XMLReader();
    $r->open($xmlfile, null, LIBXML_COMPACT);
    $matchingproducts = array();
    do {
        // advance to first product element
        $r->read();
    } while ($r->nodeType!==XMLReader::NONE
        and !($r->nodeType===XMLReader::ELEMENT and $r->name==='product' and !$r->isEmptyElement));

    while ($r->nodeType!==XMLReader::NONE) {
        if ($r->nodeType===XMLReader::ELEMENT and $r->name==='product' and !$r->isEmptyElement) {
            $dom = $r->expand(new DOMDocument('1.0','UTF-8'));
            $sxe = simplexml_import_dom($dom);
            if ((string) $sxe->sku===$desiredsku) {
                // Matching product found.
                // We have access to the <product> element and contents as:
                // * raw text via $r->readOuterXml()
                // * DOMDocument via $dom
                // * SimpleXML via $sxe
                // Pick the one you want and save:
                $matchingproducts[] = $r->readOuterXml();
                // null the rest to be very conservative about memory
                $dom = $sxe = null;
            }
        }
        // optimization--skip to next product sibling
        $r->next('product');
    }
    $r->close();
    return $matchingproducts;
}

Finally I saved all this in a file and ran it on my dual-core, 8GB system. (Numbers are peak memory, final memory, and seconds per iteration. "Found Products" is just to verify the correct number of products matched.)

$ php xmlreader.php getmatchingproducts_xml_expand test2.xml
Found products: 19969
getmatchingproducts_xml_expand 86.96 58.17 10.648
$ php xmlreader.php getmatchingproducts_regex1 test2.xml
Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 1253 bytes) in xmlreader.php on line 72
$ php xmlreader.php getmatchingproducts_regex2 test2.xml
Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 1253 bytes) in xmlreader.php on line 78

You'll notice that the regex methods could not even run without exhausting available memory! Further, the XMLReader methods (in addition to actually parsing XML correctly), used less memory than the size of the file. I'm willing to bet money that most of the getmatchingproducts_xml_expand memory is the $matchedproducts array, too, and not from parsing. You can cut down memory usage even further by wrapping the parser function in an class so you can retrieve one match at a time.

The advantage of using a Regex, though, is that it's much faster. Here's another try, raising the memory limit to 1GB:

$ php -d memory_limit=1G xmlreader.php getmatchingproducts_regex1 test2.xml
Found products: 19968
    getmatchingproducts_regex1 181.31 30.01 1.421
$ php -d memory_limit=1G xmlreader.php getmatchingproducts_regex2 test2.xml
Found products: 19968
    getmatchingproducts_regex2 181.31 30.01 0.906

All of that speed comes from ignoring the rules of XML parsing and treating it as a string. (Interestingly, the fact that the whole file is in memory doesn't affect XMLReader's speed, only its memory usage.)

If you need fast access and low memory usage, you need some kind of indexing or database. You can create a flat-file db using sqlite, sqlite3, dbm and load it with products keyed by SKU using XMLReader. Then instead of reading the XML file, load the xml string for that product from the db.

Just for kicks, I tried an XMLReader parsing method that didn't use expansion, to see if I could save time or memory. The difference was negligible, though, and the code much less clear.

function getmatchingproducts_xml_noexpand($xmlfile, $desiredsku) {
    $r = new XMLReader();
    $r->open($xmlfile, null, LIBXML_COMPACT);
    $matchingproducts = array();
    $candidateproduct = null;
    do {
        // advance to first product element
        $r->read();
    } while ($r->nodeType!==XMLReader::NONE
        and !($r->nodeType===XMLReader::ELEMENT and $r->name==='product' and !$r->isEmptyElement));
    while ($r->nodeType!==XMLReader::NONE) {
        if ($r->nodeType===XMLReader::ELEMENT and $r->name==='product' and !$r->isEmptyElement) {
            $candidateproduct = array($r->readOuterXML(), $r->depth);
            $r->read();
            while ($r->depth > $candidateproduct[1]) {
                if ($r->nodeType===XMLReader::ELEMENT and $r->name==='sku' and $r->readString()===$desiredsku) {
                    $matchingproducts[] = $candidateproduct[0];
                    $r->next('product');
                    break;
                } else {
                    $r->next();
                }
            }
            $candidateproduct = null;
        } else {
            $r->next();
        }
    }
    $r->close();
    return $matchingproducts;
}



$ php xmlreader.php getmatchingproducts_xml_noexpand test2.xml
Found products: 19969
getmatchingproducts_xml_noexpand 86.95 58.17 13.716

Returning Parse Results Incrementally

Yet another implementation. This is probably as efficient as this can get. It parses the 120MB test file using less than 1MB of memory.

class ProductMatcher implements Iterator {
    // return values for next()
    const R_STR = 'product_str'; // return string
    const R_DOM = 'product_dom'; // return DOMDocument
    const R_SXE = 'product_sxe'; // return SimpleXMLElement

    protected $reader;
    protected $productcount = null;
    protected $product_str = null;
    protected $product_dom = null;
    protected $product_sxe = null;
    protected $xmlfile;
    protected $returnmethod;
    public $desiredsku;

    function __construct($xmlfile, $desiredsku, $returnmethod=self::R_STR) {
        $this->xmlfile = $xmlfile;
        $this->desiredsku = $desiredsku;
        $this->setReturnMethod($returnmethod);
    }
    function __destruct() {
        if (isset($this->reader)) {
            $this->reader->close();
        }
    }
    protected function _create() {
        $this->productcount = null;
        $this->reader = new XMLReader();
        $this->reader->open($this->xmlfile, null, LIBXML_COMPACT);
    }
    protected function _start() {
        $r =& $this->reader;
        do {
            // advance to first product element
            $r->read();
        } while ($r->nodeType!==XMLReader::NONE
            and !($r->nodeType===XMLReader::ELEMENT and $r->name==='product' and !$r->isEmptyElement));
    }
    protected function advance() {
        $r =& $this->reader;
        $productfound = false;
        $this->product_str = $this->product_sxe = $this->product_dom = null;
        while ($r->nodeType!==XMLReader::NONE and !$productfound) {
            if ($r->nodeType===XMLReader::ELEMENT and $r->name==='product' and !$r->isEmptyElement) {
                // xmlreader_print($r);
                $dom = $r->expand(new DOMDocument('1.0','UTF-8'));
                $sxe = simplexml_import_dom($dom);
                if ((string) $sxe->sku===$this->desiredsku) {
                    $this->product_str = $r->readOuterXml();
                    $this->product_sxe = $sxe;
                    $this->product_dom = $dom;
                    $productfound = true;
                    $this->productcount = (isset($this->productcount)) ? $this->productcount+1 : 0;
                }
            }
            // optimization--skip to next product sibling
            $r->next('product');
        }
        if (!$productfound) {
            $this->productcount = null;
        }
    }
    public function setReturnMethod($method) {
        $this->returnmethod = $method;
    }
    public function getReturnMethod() {
        return $this->returnmethod;
    }
    public function rewind() {
        $this->_create();
        $this->_start();
        $this->advance();
    }
    public function valid() {
        return $this->productcount!==null;
    }
    public function current() {
        return $this->{$this->returnmethod};
    }
    public function key() {
        return $this->productcount;
    }
    public function next() {
        $this->advance();
    }
}

function timeProductMatcher($filename) {
    $matcher = new ProductMatcher($filename, 'SKUTARGET');
    foreach ($matcher as $m) {}
    $runtime = 0;
    $iter = 2;
    for ($i=0; $i < $iter; $i++) { 
        $start = microtime(true);
        $matcher = new ProductMatcher($filename, 'SKUTARGET');
        foreach ($matcher as $n => $match) {}
        $runtime += microtime(true)-$start;
    }
    echo "Found products: ",$n+1, "\n";
    printf("%30s %3.2f %3.2f %4.3f\n", 'ProductMatcher', memory_get_peak_usage()/(1024*1024), memory_get_usage()/(1024*1024), $runtime/$iter);
}
timeProductMatcher($argv[1]);

Results:

$ php xmlreader.php test2.xml
Found products: 19969
                ProductMatcher 0.76 0.75 10.394

Expanded example usage:

$matcher = new ProductMatcher($filename, 'SKUTARGET', ProductMatcher::R_SXE);
foreach ($matcher as $product) {
    // $product is a SimpleXMLElement because we specified R_SXE
    (string) $product->sku === 'SKUTARGET'; // true
}

Upvotes: 2

Alan Moore
Alan Moore

Reputation: 75222

The main problem with your code is that you're using preg_match incorrectly. The return value is just an integer representing the number times the regex matched--i.e., 0 or 1. If you want to retrieve the matched text, you have to supply an array to store it in:

preg_match($pattern, $subject, $matches);

But there's a problem with the regex too, which you'll see if you make the second <product> element your target instead of the first one. The match still starts with the first <product> element, then continues to the end of the second one. The reluctant .*? is not sufficient to guarantee the shortest possible match, because it only affects where the match ends, not where it starts.

You need to make sure that, after it matches the opening <product> tag, it can't match any more <product> or </product> tags before it finds the <sku> tag. Assuming <sku> is always first element listed inside the <product> element, that's a simple matter of changing the first .*? to [^<]*:

"~<product [^<]*<sku>$tar_sku</sku>.*?</product>~is"

Furthermore, given the size of the files, it might be worth the effort the make the regex as efficient as possible. To that end, I would make that first quantifier possessive - [^<]*+ - and replace the other .*? with something more deterministic.

"~<product [^<]*+<sku>$tar_sku</sku>[^<]*(?:<(?!/product>)[^<]*)*</product>~"

Notice that I also removed the modifiers; the s flag is irrelevant now since there are no dots in regex, and the i flag probably never was needed, given that XML tag names are case sensitive. If the SKU isn't, you can apply the i flag to just that part of the regex with an inline modifier, i.e., (?i:...):

"~<product [^<]*+<sku>(?i:$tar_sku)</sku>[^<]*(?:<(?!/product>)[^<]*)*</product>~"

Here's a demo: http://ideone.com/mZqFz

Upvotes: 0

Rob Apodaca
Rob Apodaca

Reputation: 834

Do not use regex to parse xml. Instead, use the family of xml tools for php. It's not clear exactly what you want to do here. It looks like you want to pick out a specific element from the document and replace it with something else. Here is an example.

[edit] Ok, so if you want to process large xml files, use xmlreader.

Upvotes: 0

mathematical.coffee
mathematical.coffee

Reputation: 56905

If you only want to extract the matching <product>..</product> bit, use preg_match, not preg_replace.

$pattern = "~<product .*?<sku>$tar_sku</sku>.*?</product>~is";
$returnValues = preg_match($pat,$xml_str);

Here $returnValues is an array that is either empty or has one elment $returnValues[0] containing the relevant bit of XML you're after.

This is because preg_match stops at the first match. If you know that there'll only be one corresponding SKUTARGET in the whole XML, use preg_match. If you think there could be more than one and want to extract all of them, use preg_match_all.

The regex is basically the same as yours, except:

  • you can't use / to delimit your regex (like /regex/) if there are internal / within the regex, unless you escape them. So you'd have to escape the / in </sku> etc. I've changed the delimiter to ~ so I don't need to bother escaping my internal /.
  • <product * -> <product .* (the former only matches spaces)
  • </sku>*</product> -> </sku>.*?</product> (the former only matches 0 or more > after the /sku.
  • changed greedy .* to non-greedy .*? to prevent grabbing XML belonging to the next product, for example.

Upvotes: 0

Related Questions