bln_dev
bln_dev

Reputation: 2421

How to extract html markup within an XML node with XPath

I'm using DOMDocument and XPath.

Given to following XML

<Description>
    <CompleteText>
        <DetailTxt>
            <Text>
                <span>Here there is some text</span>
                <h2>And maybe a headline</h2>
                <br/>
                <span>Normal position</span>
                <br/>
                <span> </span>
                <br/>
            </Text>
        </DetailTxt>            
    </CompleteText>
</Description>

The node /Description/CompleteText/DetailTxt/Text contains markup, unfortunately unescaped, but I can't change that. Is there any chance I can query that content maintaining the html markup?

What I tried

Obviously, nodeValue but also textContent. Both giving me the content omitting markup.

Upvotes: 1

Views: 241

Answers (2)

bln_dev
bln_dev

Reputation: 2421

I find a good result with using the C14n method of DOMNode.

http://sandbox.onlinephpfunctions.com/code/90dc915c9a43c91d31fcd47d37e89df430951b2e

<?php
$xml = <<<'EOD'
<Description>
    <CompleteText>
        <DetailTxt>
            <Text>
                <span>Here there is some text</span>
                <h2>And maybe a headline</h2>
                <br/>
                <span>Normal position</span>
                <br/>
                <span> </span>
                <br/>
            </Text>
        </DetailTxt>            
    </CompleteText>
</Description>  
EOD;

$doc = new DOMDocument();

$doc->loadXML($xml);

$xpath = new DOMXPath($doc);

function innerHTML($nodeList) {
  $node = $nodeList[0];
  $html = '';
  $containingDoc = $node->ownerDocument;
  foreach ($node->childNodes as $child) {
      $html .= $containingDoc->saveHTML($child);
    }
  return $html;
}

$xpath->registerNamespace("php", "http://php.net/xpath");


$domNodes = $xpath->query('/Description/CompleteText/DetailTxt/Text');
$domNode = $domNodes[0];
$innerHTML = $domNode->C14N();

echo $innerHTML;

Result

<Text>
                <span>Here there is some text</span>
                <h2>And maybe a headline</h2>
                <br></br>
                <span>Normal position</span>
                <br></br>
                <span> </span>
                <br></br>
            </Text>

Seems shorter in a way, what do you think? I would need to get rid of node though. Thanks also for pointing me to PHP Sandbox.

Update

I realize, C14N() changes the markup. See <br /> to <br></br>.

Upvotes: 0

Martin Honnen
Martin Honnen

Reputation: 167716

You can use the saveHTML method of DOMDocument to serialize a node as HTML, in your case you seem to want to call it on each child node of the selected node and concatenate the strings; in the browser DOM APIs that would be called innerHTML so I have written a function of that name doing that and also used the ability to call PHP functions from XPath in the following snippet:

<?php
$xml = <<<'EOD'
<Description>
    <CompleteText>
        <DetailTxt>
            <Text>
                <span>Here there is some text</span>
                <h2>And maybe a headline</h2>
                <br/>
                <span>Normal position</span>
                <br/>
                <span> </span>
                <br/>
            </Text>
        </DetailTxt>            
    </CompleteText>
</Description>  
EOD;

$doc = new DOMDocument();

$doc->loadXML($xml);

$xpath = new DOMXPath($doc);

function innerHTML($nodeList) {
  $node = $nodeList[0];
  $html = '';
  $containingDoc = $node->ownerDocument;
  foreach ($node->childNodes as $child) {
      $html .= $containingDoc->saveHTML($child);
    }
  return $html;
}

$xpath->registerNamespace("php", "http://php.net/xpath");
$xpath->registerPHPFunctions("innerHTML");



$innerHTML = $xpath->evaluate('php:function("innerHTML", /Description/CompleteText/DetailTxt/Text)');

echo $innerHTML;

Output as http://sandbox.onlinephpfunctions.com/code/62a980e2d2a2485c2648e16fc647a6bd6ff5620b is

            <span>Here there is some text</span>
            <h2>And maybe a headline</h2>
            <br>
            <span>Normal position</span>
            <br>
            <span> </span>
            <br>

Upvotes: 1

Related Questions