Ionut Flavius Pogacian
Ionut Flavius Pogacian

Reputation: 4801

How to remove invalid element from DOM?

We have the following code that lists the xpaths where $value is found.

We have detected for a given URL (see on picture) a non standard tag td1 which in addition doesn't have a closing tag. Probably the site developers have put that there intentionally, as you see in the screen shot below.

This element creates problems identifying the corect XPath for nodes.

A broken Xpath example :

/html/body/div[2]/div[2]/table/tr[2]/td/table/tr[1]/td[2]/table/tr[2]/td[2]/table[3]/tr[2]/**td1**/td[2]/span/u[1]

(as you see td1 is identified and chained in the Xpath)

We think by removing this element it helps us to build the valid XPath we are after.

A valid example is

/html/body/div[2]/div[2]/table/tr[2]/td/table/tr[1]/td[2]/table/tr[2]/td[2]/table[3]/tr[2]/td[2]/span/u[1]

How can we remove prior loading in DOMXpath? Do you have some other approach?

We would like to remove all the invalid tags which may be other than td1, as h8, diw, etc...

private function extract($url, $value) {

        $dom = new DOMDocument();

        $file = 'content.txt';
        //$current = file_get_contents($url);
        $current = CurlTool::downloadFile($url, $file);
        //file_put_contents($file, $current);

        @$dom->loadHTMLFile($current);

        //use DOMXpath to navigate the html with the DOM
        $dom_xpath = new DOMXpath($dom);

        $elements = $dom_xpath->query("//*[text()[contains(., '" . $value . "')]]");
        var_dump($elements);
        if (!is_null($elements)) {

            foreach ($elements as $element) {
                var_dump($element);
                echo "\n1.[" . $element->nodeName . "]\n";

                $nodes = $element->childNodes;
                foreach ($nodes as $node) {
                    if( ($node->nodeValue != null) && ($node->nodeValue === $value) ) {
                        echo '2.' . $node->nodeValue . "\n";
                        $xpath = preg_replace("/\/text\(\)/", "", $node->getNodePath());
                        echo '3.' . $xpath . "\n";
                    }
                }
            }
        }
    }

enter image description here

Upvotes: 1

Views: 499

Answers (2)

jimp
jimp

Reputation: 17487

You could use XPath to find the offending nodes and remove them, while promoting its children into its place in the DOM. Then your paths will be correct.

$dom_xpath = new DOMXpath($dom);
$results = $dom_xpath->query('//td1'); // (or any offending element)
foreach ($results as $invalidNode)
{
    $parentNode = $invalidNode->parentNode;
    while ($invalidNode->childNodes)
    {
        $firstChild = $invalidNode->firstChild;
        $parentNode->insertBefore($firstChild,$invalidNode);
    }
    $parentNode->removeChild($invalidNode);
}

EDIT:

You could also build a list of offending elements by using a list of valid elements and negating it.

// Build list manually from the HTML spec:
// See: http://www.w3.org/TR/html5/section-index.html#elements-1
$validTags = array();

// Convert list to XPath:
$validTagsStr = '';
foreach ($validTags as $tag)
{
    if ($validTagsStr)
    {   $validTagsStr .= ' or ';    }
    $validTagsStr .= 'self::'.$tag;
}
$results = $dom_xpath->query('//*[not('.$validTagsStr.')');

Upvotes: 1

Jan Sommer
Jan Sommer

Reputation: 3798

Sooo... perhaps str_replace($current, "<td1 va-laign=\"top\">", "") could do the trick?

Upvotes: 1

Related Questions