Steven
Steven

Reputation: 75

PHP DOMNode : how to extract not only text but HTML tags also

I'm trying to make a script that scrapes a website to retrieve the latest news updates. Unfortunately I've run into a small issue that I can't seem to fix with my limited knowledge of DOM.

The page I'm trying to scrape is built as follows :

<table>
<tr class="color1">
<td>Author</td>
<td>Content <a href="#">in HTML</a></td>
<td>Date</td>
</tr>
</table>

I can retrieve the fields I need just fine, except for content. With $td->nodeValue I retrieve the content in text form, whereas I want it in HTML (there's 'a' tags in there, 'blockquote', etc)

Here's the code I have :

try {
    $html = @ file_get_contents("test.php");
    checkIfFileExists($html);

    $dom = new DOMDocument();
    @ $dom->loadHTML($html);

    $trNodes = $dom->getElementsByTagName("tr");
    foreach ($trNodes as $tr) {

        if ($tr->getAttribute("class") == "color1" || $tr->getAttribute("class") == "color2") {

        $tdNodes = $tr->childNodes;
        foreach ($tdNodes as $td) {

            echo $td->nodeValue . "<br />\n";

        }
        echo "<br /><br /><br /><br /><br />\n";
    }
} catch(Exception $e) {
    echo $e->getMessage();
}

I would prefer not to have to resort to any third party library, but obviously any answer is most appreciated, library or not.

Thanks in advance.

Upvotes: 6

Views: 870

Answers (1)

Frederic Bazin
Frederic Bazin

Reputation: 1529

replace

echo $td->nodeValue . "<br />\n";

with

echo $dom->saveXML($td)  . "<br />\n";

Upvotes: 4

Related Questions