PHP DOMNode : how to extract not only text but HTML tags also

Question

I'm trying to make a script that scrapes a website to retrieve the latest news updates. Unfortunately I've run into a small issue that I can't seem to fix with my limited knowledge of DOM.

The page I'm trying to scrape is built as follows :



Author
Content in HTML
Date

I can retrieve the fields I need just fine, except for content. With $td->nodeValue I retrieve the content in text form, whereas I want it in HTML (there's 'a' tags in there, 'blockquote', etc)

Here's the code I have :

try {
    $html = @ file_get_contents("test.php");
    checkIfFileExists($html);

    $dom = new DOMDocument();
    @ $dom->loadHTML($html);

    $trNodes = $dom->getElementsByTagName("tr");
    foreach ($trNodes as $tr) {

        if ($tr->getAttribute("class") == "color1" || $tr->getAttribute("class") == "color2") {

        $tdNodes = $tr->childNodes;
        foreach ($tdNodes as $td) {

            echo $td->nodeValue . "

";

        }
        echo "





";
    }
} catch(Exception $e) {
    echo $e->getMessage();
}

I would prefer not to have to resort to any third party library, but obviously any answer is most appreciated, library or not.

Thanks in advance.

Frederic Bazin · Accepted Answer

replace

echo $td->nodeValue . "

";

with

echo $dom->saveXML($td)  . "

";

PHP DOMNode : how to extract not only text but HTML tags also

Answers (1)

Related Questions