Reputation: 68492

XPath - select empty elements that are not part of a list

$list = array('br', 'hr', 'link', 'meta', 'title');

Using DOMXpath, how can I select nodes that are empty and their tagName is not within $list? (I want to add a space in their textContent so they are not automatically closed)

Upvotes: 2

Answers (5)

Dimitre Novatchev

Reputation: 243449

Here is a single, one-liner XPath expression that selects the wanted nodes:

//*[not(node()[not(self::text())]) 
  and not(normalize-space) 
  and contains('|br|hr|link|meta|title|', concat('|', name(), '|'))
   ]

This selects any element in the XML document that only has a text-node child (if any at all) and whose normalized (all leading and trailing whit-space characters deleted and all intermediate adjacent white-space characters replaced by a single space) string value is the empty string, and whose name is one of br, hr, meta or title.

XSLT - based verification:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>


  <xsl:template match="/">
   <xsl:copy-of select=
   "//*[not(node()[not(self::text())])
      and not(normalize-space)
      and contains('|br|hr|link|meta|title|', concat('|', name(), '|'))
       ]
   "/>
  </xsl:template>
</xsl:stylesheet>

When this transformation is applied on the following XML document:

<html lang='en'>
    <head>
        <meta charset='utf-8'/>
        <title></title>
        <link rel='stylesheet' href='/assets/index.css'/>
    </head>
    <body>
        <div>
            <header>
                <h1></h1>
            </header>
            <section>
                <article></article>
                <aside></aside>
            </section>
            <br />
            <footer>
                <small>
                 Copyright &#169;
                    <span></span>
                </small>
            </footer>
        </div>
        <script src='//code.jquery.com/jquery-latest.min.js'></script>
        <script src='/assets/index.js'></script>
    </body>
</html>

the XPath expression is evaluated and the (correctly) selected nodes are copied to the output:

<meta charset="utf-8"/>
<title/>
<link rel="stylesheet" href="/assets/index.css"/>
<br/>

Upvotes: 3

MacNimble

Reputation: 1061

I use something like this to accomplish a similar task:

<?php
$xml = <<<XML
<html lang='en'>
  <head>
    <meta charset='utf-8'/>
    <title></title>
    <link rel='stylesheet' href='/assets/index.css'/>
  </head>
  <body>
    <div>
      <header>
        <h1></h1>
      </header>
      <section>
        <article></article>
        <aside></aside>
      </section>
      <footer>
        <small>
          Copyright &#169;
          <span></span>
        </small>
      </footer>
    </div>
    <script src='//code.jquery.com/jquery-latest.min.js'></script>
    <script src='/assets/index.js'></script>
  </body>
</html>
XML;
$dom = new DOMDocument;
$dom->loadXML($xml);
$xpath = new DOMXPath($dom);
$null = array( 'br','hr','meta','link','base','link','meta','img'
             , 'embed','param','area','col','input' );
array_walk($null, function(&$v){$v = "not(self::{$v})";});
array_unshift($null, 'not(normalize-space())');
$null = implode(' and ', $null);
$node = $xpath->query("//*[{$null}]");

$collapsed = htmlspecialchars($dom->saveXML($dom->documentElement));
foreach ($node as $n) $n->appendChild($dom->createTextNode(''));
$separated = htmlspecialchars($dom->saveXML($dom->documentElement));

echo '<pre>', $collapsed, '<hr/>', $separated, '</pre>';
?>

Upvotes: 1

ceving

Reputation: 23824

The Xpath engine does not have access to PHP variables. You have to quote the list as an valid Xpath expression or you have to filter the dom nodes in PHP. The PHP manual explains how to implement filters: http://www.php.net/manual/en/book.filter.php

Upvotes: 1

Explosion Pills

Reputation: 191749

You didn't give us any XML to work with, which is not very nice, but here you go:

$xml = <<<XML
<div>
   <a>
   </a>
   <p>some text</p>
   <p></p>
   <span>no text
      <hr/>
      <ul></ul>
   </span>
   <br/>
</div>
XML;

$dom = new DOMDocument;
$dom->loadXML($xml);
$xpath = new DOMXPath($dom);
$list = array('br', 'hr', 'link', 'meta', 'title');
$expr = array();
foreach ($list as $l) {
   $expr[] = "not(self::$l)";
}
$expr = implode(' and ', $expr);

foreach ($xpath->query("//*[$expr and not(normalize-space())]") as $elem) {
   echo "$elem->nodeName\n";
}

This outputs

a
p
ul

As expected. Now you have the nodes -- it's up to you to add the space. IMO it would be easier to just use not(normalize-space()) and then see if the nodeName is not in your list, but you asked for an XPath expression, so that's what you got.

Note that normalize-space() is used because pure whitespace may still cause the node to automatically close. If that's not an issue, you can use node() instead.

Upvotes: 3

Taha Paksu

Reputation: 15616

$doc = new DOMDocument();
$doc->loadHTMLFile($file);
$xpath = new DOMXpath($doc);

$list = array('br', 'hr', 'link', 'meta', 'title');
$empty_items = $xpath->query("//*[not(text())]");
foreach($empty_items as $key=>$element){
    if(is_object($element) &&
       get_class($element) == 'DOMElement' &&
       in_array($element->nodeName,$list)){
        unset($empty_items[$key]);
    }
}

Note: I didn't test it. It may have typos or wrong object properties.

Upvotes: 1

XPath - select empty elements that are not part of a list

Answers (5)

Related Questions