someone
someone

Reputation:

how to extract data from CDATA

I am extracting data from an XML and some tags have data inside CDATA in this way

<description><![CDATA[Changes (as compared to 8.17) include:
Features:
    * Added a &#8216;Schema Optimizer&#8217; feature. Based on &#8220;procedure analyse()&#8221; it will propose alterations to data types for a table based on analysis on what data are stored in the table. The feature is available from INFO tab/HTML mode.  Refer to documentation for details.
    * A table can now be added [...]]]>
</description>

I am already using preq_match to extract data from description tag.So How can I extract data from CDATA?

Upvotes: 1

Views: 2537

Answers (3)

Mubashar
Mubashar

Reputation: 12658

@Pavel Minaev is right keep the option of regular expression as a last resort, and for xml always use Xml parser you can find the xml parser now in almost all languages. e.g. I usually use DOMDocument to parse or create xml in php. Its really simple and easy to understand specially for people like me who use php occasionally.

e.g you like to extract CDATA from following xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE message SYSTEM "https://www.abcd.com/dtds/AbcdefMessageXmlApi.dtd">
<message id="9002">
  <report>
    <![CDATA[id:50121515075540159 sub:001 text text text text text]]>
  </report>
  <number>353874181931</number>
</message>

Use following code to extract the CDATA

$doc = new DOMDocument();
$doc->preserveWhiteSpace = false;

if (TRUE != $doc->loadXML($xml_response)) {

    // log error and / or throw expection or whatever
}

$response_element = $doc->documentElement;

if($response_element->tagName ==  "message"){

    $report_node = $response_element->getElementsByTagName("report");

    if($report_node != null && $report_node->length == 1) {

        $narrative = $report_node->item(0)->textContent;

        $log->debug("CDATA: $narrative");

    } else {

        $log->error("unable to find report tag or multiple report tag found in response xml");
    }

} else {

    $log->error("unexpected root tag (" . $response_element->tagName .") in response xml");
}

after execution of this $narrative variable should have all the text, and don't worry it will not contain the ugly tag part CDATA.

Happy coding :)

Upvotes: 0

RageZ
RageZ

Reputation: 27313

you should use simple_xml and xpath if you need to extract a complex set of data.

<?php
$string = <<<XML
<?xml version='1.0'?> 
<document>
 <title>Forty What?</title>
 <from>Joe</from>
 <to>Jane</to>
 <body>
  I know that's the answer -- but what's the question?
 </body>
</document>
XML;

$xml = simplexml_load_string($string);

var_dump($xml);
?>

would provide output like this :

SimpleXMLElement Object
(
  [title] => Forty What?
  [from] => Joe
  [to] => Jane
  [body] =>
   I know that's the answer -- but what's the question?
)

so in your case you would just to navigate inside your document really more easy then reg expressions, isn't it?

Upvotes: 0

Pavel Minaev
Pavel Minaev

Reputation: 101565

Regardless of the language, don't use regular expressions to parse XML - you will almost certainly get it wrong. Use an XML parser.

Upvotes: 7

Related Questions