Tony
Tony

Reputation: 3068

Trying to use DOMDocument::loadHTMLFile with a generated url

Im calling the DOMDocument::loadHTMLFile method using a url I built.

This is the code I used to build the url:

$url = "http://en.wikipedia.org".$path

The $path is obtained from an href attribute of another file. when I echo it returns /wiki/Pop_music

If I hardcode the url to http://en.wikipedia.org/wiki/Pop_music the page returns fine, but if I try to use my generated path I am getting errors.

This is the code I'm currently working with:

    foreach ($paths as $path) 
    {
        echo $path;                                         // will cause error 
        //echo $path = '/wiki/Pop_music';                       // will work
        $url = "http://en.wikipedia.org"."$path";
        $doc = getHTML($url, 1);

        if($doc !== false)
        {
            $xpath = new DOMXPath($doc);
            $xpathCode = "//h1[@id='firstHeading']";
            $nodes = $xpath->query($xpathCode);
            echo $nodes->item(0)->nodeValue."<br />";
        }
    }

The getHTML function is:

function getHTML($url, $domainID)
{
    $conArtistsCrawler = new mysqli(HOST, USERNAME, PASSWORD, CRAWLER_DB_NAME);

    // Load HTML
    $doc = new DOMDocument();
    $isSuccessful = $doc->loadHTMLFile($url);

    // Update the time to show that the domain was crawled.
    $sql = "UPDATE Domain SET LastCrawled = CURRENT_TIMESTAMP() WHERE DomainID = '$domainID'";
    $conArtistsCrawler->query($sql);
    $conArtistsCrawler->close();

    // Delay 1 second after the request to avoid getting BANNED
    sleep(1);

    // Check to see if URL is valid
    if($isSuccessful === false)
    {
        //URL invalid!
        echo "\"".$url."\" is invalid<br>";
        return false;
    }

    return $doc;
}

The code outputs:

With hardcoded path:

Warning: DOMDocument::loadHTMLFile(): ID protected-icon already defined in http://en.wikipedia.org/wiki/Dido%20(singer), line: 60 in /Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line 77 /wiki/Pop_music Warning: DOMDocument::loadHTMLFile(): Tag audio invalid in http://en.wikipedia.org/wiki/Pop_music, line: 225 in /Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line 77

Warning: DOMDocument::loadHTMLFile(): Tag source invalid in http://en.wikipedia.org/wiki/Pop_music, line: 225 in /Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line 77 Pop music

With path variable:

Warning: DOMDocument::loadHTMLFile(): ID protected-icon already defined in http://en.wikipedia.org/wiki/Dido%20(singer), line: 60 in /Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line 77 /wiki/Pop_music

Warning: DOMDocument::loadHTMLFile(http://en.wikipedia.org/wiki/Pop_music%3Cbr%20/%3E): failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request in /Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line 77

Warning: DOMDocument::loadHTMLFile(): I/O warning : failed to load external entity "http://en.wikipedia.org/wiki/Pop_music%3Cbr%20/%3E" in /Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line 77 "http://en.wikipedia.org/wiki/Pop_music " is invalid

Upvotes: 2

Views: 3270

Answers (1)

Tivie
Tivie

Reputation: 18923

Short answer:

Well, the error you're getting is due to the fact that $doc is not a DOMDocument object but it's the boolean false. Since you're suppressing DOMDocument warnings, you can't know why getHTML() is returning false.

So, lose the @ operator, check what DOMDocument is complaining about and debug from there.

Edit:

but I am still unsure why when I pass in the variable I get a different result then when I hardcode it. When I echo both path values or url values they look identical

They are certainly not identical. You have a <br/> tag after Pop_Music which makes the url invalid.

Long Answer

Running this script:

$path = '/wiki/Pop_music';
$url = "http://en.wikipedia.org$path";
$doc = new \DOMDocument();
$success = @$doc->loadHTMLFile($url);

if ($success) {
    $xpath = new DOMXPath($doc);
    $xpathCode = "//h1[@id='firstHeading']";
    $nodes = $xpath->query($xpathCode);
    echo $nodes->item(0)->nodeValue."<br />";
}

produces the following result:

Pop music<br />

So, in order to troubleshoot your script there are a couple of things you should do...

Lose the @ operator

Do not use @ operator. This will eat any warning thrown at you, and makes debugging a lot harder. In all truth, DOMDocument complains a lot, sometimes about errors that aren't really errors (such as some HTML5 tags). But it will also throw valid warnings, such as malformed HTML or unreachable URL.

Best way to handle this is using a custom error handler and loading it before DOMDocument.

This will enable you to digest the warnings given by DOMDocument and differentiate between important and trivial ones.

Example:

set_error_handler(function($errno, $errstr, $errfile, $errline) {
    //Digest error here
});

$doc = new DOMDocument();
$isSuccessful = $doc->loadHTMLFile($url);

restore_error_handler();

Note: You can also use libxml_use_internal_errors(true);

Your getHTML function returns an inconsistent type

Your getHTML function can either return a DOMDocument object or a Boolean. While that isn't a bad thing per se (and internally, PHP does that with a lot of functions), that means you can't assume $doc is an object because it can be the boolean false. So you have to test the returned value before passing it as an argument to XDOMPath. In fact, that's the error you're getting:

You're passing a boolean to XDOMPath instead of a DOMDocument object to XDOMPath

Either throw an an exception (or error) in the function or test the returned value before passing to XDOMPath.

example:

$doc = getHTML($url, 1);
if ($doc instanceof \DOMDocument) {
    $xpath = new DOMXPath($doc);
}

Upvotes: 3

Related Questions