Reputation: 68492
$list = array('br', 'hr', 'link', 'meta', 'title');
Using DOMXpath, how can I select nodes that are empty and their tagName is not within $list?
(I want to add a space in their textContent
so they are not automatically closed)
Upvotes: 2
Views: 2051
Reputation: 243449
Here is a single, one-liner XPath expression that selects the wanted nodes:
//*[not(node()[not(self::text())])
and not(normalize-space)
and contains('|br|hr|link|meta|title|', concat('|', name(), '|'))
]
This selects any element in the XML document that only has a text-node child (if any at all) and whose normalized (all leading and trailing whit-space characters deleted and all intermediate adjacent white-space characters replaced by a single space) string value is the empty string, and whose name is one of br
, hr
, meta
or title
.
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:copy-of select=
"//*[not(node()[not(self::text())])
and not(normalize-space)
and contains('|br|hr|link|meta|title|', concat('|', name(), '|'))
]
"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the following XML document:
<html lang='en'>
<head>
<meta charset='utf-8'/>
<title></title>
<link rel='stylesheet' href='/assets/index.css'/>
</head>
<body>
<div>
<header>
<h1></h1>
</header>
<section>
<article></article>
<aside></aside>
</section>
<br />
<footer>
<small>
Copyright ©
<span></span>
</small>
</footer>
</div>
<script src='//code.jquery.com/jquery-latest.min.js'></script>
<script src='/assets/index.js'></script>
</body>
</html>
the XPath expression is evaluated and the (correctly) selected nodes are copied to the output:
<meta charset="utf-8"/>
<title/>
<link rel="stylesheet" href="/assets/index.css"/>
<br/>
Upvotes: 3
Reputation: 1061
I use something like this to accomplish a similar task:
<?php
$xml = <<<XML
<html lang='en'>
<head>
<meta charset='utf-8'/>
<title></title>
<link rel='stylesheet' href='/assets/index.css'/>
</head>
<body>
<div>
<header>
<h1></h1>
</header>
<section>
<article></article>
<aside></aside>
</section>
<footer>
<small>
Copyright ©
<span></span>
</small>
</footer>
</div>
<script src='//code.jquery.com/jquery-latest.min.js'></script>
<script src='/assets/index.js'></script>
</body>
</html>
XML;
$dom = new DOMDocument;
$dom->loadXML($xml);
$xpath = new DOMXPath($dom);
$null = array( 'br','hr','meta','link','base','link','meta','img'
, 'embed','param','area','col','input' );
array_walk($null, function(&$v){$v = "not(self::{$v})";});
array_unshift($null, 'not(normalize-space())');
$null = implode(' and ', $null);
$node = $xpath->query("//*[{$null}]");
$collapsed = htmlspecialchars($dom->saveXML($dom->documentElement));
foreach ($node as $n) $n->appendChild($dom->createTextNode(''));
$separated = htmlspecialchars($dom->saveXML($dom->documentElement));
echo '<pre>', $collapsed, '<hr/>', $separated, '</pre>';
?>
Upvotes: 1
Reputation: 23824
The Xpath engine does not have access to PHP variables. You have to quote the list as an valid Xpath expression or you have to filter the dom nodes in PHP. The PHP manual explains how to implement filters: http://www.php.net/manual/en/book.filter.php
Upvotes: 1
Reputation: 191749
You didn't give us any XML to work with, which is not very nice, but here you go:
$xml = <<<XML
<div>
<a>
</a>
<p>some text</p>
<p></p>
<span>no text
<hr/>
<ul></ul>
</span>
<br/>
</div>
XML;
$dom = new DOMDocument;
$dom->loadXML($xml);
$xpath = new DOMXPath($dom);
$list = array('br', 'hr', 'link', 'meta', 'title');
$expr = array();
foreach ($list as $l) {
$expr[] = "not(self::$l)";
}
$expr = implode(' and ', $expr);
foreach ($xpath->query("//*[$expr and not(normalize-space())]") as $elem) {
echo "$elem->nodeName\n";
}
This outputs
a
p
ul
As expected. Now you have the nodes -- it's up to you to add the space. IMO it would be easier to just use not(normalize-space())
and then see if the nodeName
is not in your list, but you asked for an XPath expression, so that's what you got.
Note that normalize-space()
is used because pure whitespace may still cause the node to automatically close. If that's not an issue, you can use node()
instead.
Upvotes: 3
Reputation: 15616
$doc = new DOMDocument();
$doc->loadHTMLFile($file);
$xpath = new DOMXpath($doc);
$list = array('br', 'hr', 'link', 'meta', 'title');
$empty_items = $xpath->query("//*[not(text())]");
foreach($empty_items as $key=>$element){
if(is_object($element) &&
get_class($element) == 'DOMElement' &&
in_array($element->nodeName,$list)){
unset($empty_items[$key]);
}
}
Note: I didn't test it. It may have typos or wrong object properties.
Upvotes: 1