Reputation: 4801
We have the following code that lists the xpaths where $value
is found.
We have detected for a given URL (see on picture) a non standard tag td1
which in addition doesn't have a closing tag. Probably the site developers have put that there intentionally, as you see in the screen shot below.
This element creates problems identifying the corect XPath for nodes.
A broken Xpath example :
/html/body/div[2]/div[2]/table/tr[2]/td/table/tr[1]/td[2]/table/tr[2]/td[2]/table[3]/tr[2]/**td1**/td[2]/span/u[1]
(as you see td1 is identified and chained in the Xpath)
We think by removing this element it helps us to build the valid XPath we are after.
A valid example is
/html/body/div[2]/div[2]/table/tr[2]/td/table/tr[1]/td[2]/table/tr[2]/td[2]/table[3]/tr[2]/td[2]/span/u[1]
How can we remove prior loading in DOMXpath? Do you have some other approach?
We would like to remove all the invalid tags which may be other than td1, as h8, diw, etc...
private function extract($url, $value) {
$dom = new DOMDocument();
$file = 'content.txt';
//$current = file_get_contents($url);
$current = CurlTool::downloadFile($url, $file);
//file_put_contents($file, $current);
@$dom->loadHTMLFile($current);
//use DOMXpath to navigate the html with the DOM
$dom_xpath = new DOMXpath($dom);
$elements = $dom_xpath->query("//*[text()[contains(., '" . $value . "')]]");
var_dump($elements);
if (!is_null($elements)) {
foreach ($elements as $element) {
var_dump($element);
echo "\n1.[" . $element->nodeName . "]\n";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
if( ($node->nodeValue != null) && ($node->nodeValue === $value) ) {
echo '2.' . $node->nodeValue . "\n";
$xpath = preg_replace("/\/text\(\)/", "", $node->getNodePath());
echo '3.' . $xpath . "\n";
}
}
}
}
}
Upvotes: 1
Views: 499
Reputation: 17487
You could use XPath to find the offending nodes and remove them, while promoting its children into its place in the DOM. Then your paths will be correct.
$dom_xpath = new DOMXpath($dom);
$results = $dom_xpath->query('//td1'); // (or any offending element)
foreach ($results as $invalidNode)
{
$parentNode = $invalidNode->parentNode;
while ($invalidNode->childNodes)
{
$firstChild = $invalidNode->firstChild;
$parentNode->insertBefore($firstChild,$invalidNode);
}
$parentNode->removeChild($invalidNode);
}
EDIT:
You could also build a list of offending elements by using a list of valid elements and negating it.
// Build list manually from the HTML spec:
// See: http://www.w3.org/TR/html5/section-index.html#elements-1
$validTags = array();
// Convert list to XPath:
$validTagsStr = '';
foreach ($validTags as $tag)
{
if ($validTagsStr)
{ $validTagsStr .= ' or '; }
$validTagsStr .= 'self::'.$tag;
}
$results = $dom_xpath->query('//*[not('.$validTagsStr.')');
Upvotes: 1
Reputation: 3798
Sooo... perhaps str_replace($current, "<td1 va-laign=\"top\">", "")
could do the trick?
Upvotes: 1