Reputation: 3068
Im calling the DOMDocument::loadHTMLFile method using a url I built.
This is the code I used to build the url:
$url = "http://en.wikipedia.org".$path
The $path
is obtained from an href attribute of another file. when I echo it returns /wiki/Pop_music
If I hardcode the url to http://en.wikipedia.org/wiki/Pop_music
the page returns fine, but if I try to use my generated path I am getting errors.
This is the code I'm currently working with:
foreach ($paths as $path)
{
echo $path; // will cause error
//echo $path = '/wiki/Pop_music'; // will work
$url = "http://en.wikipedia.org"."$path";
$doc = getHTML($url, 1);
if($doc !== false)
{
$xpath = new DOMXPath($doc);
$xpathCode = "//h1[@id='firstHeading']";
$nodes = $xpath->query($xpathCode);
echo $nodes->item(0)->nodeValue."<br />";
}
}
The getHTML function is:
function getHTML($url, $domainID)
{
$conArtistsCrawler = new mysqli(HOST, USERNAME, PASSWORD, CRAWLER_DB_NAME);
// Load HTML
$doc = new DOMDocument();
$isSuccessful = $doc->loadHTMLFile($url);
// Update the time to show that the domain was crawled.
$sql = "UPDATE Domain SET LastCrawled = CURRENT_TIMESTAMP() WHERE DomainID = '$domainID'";
$conArtistsCrawler->query($sql);
$conArtistsCrawler->close();
// Delay 1 second after the request to avoid getting BANNED
sleep(1);
// Check to see if URL is valid
if($isSuccessful === false)
{
//URL invalid!
echo "\"".$url."\" is invalid<br>";
return false;
}
return $doc;
}
The code outputs:
With hardcoded path:
Warning: DOMDocument::loadHTMLFile(): ID protected-icon already defined in http://en.wikipedia.org/wiki/Dido%20(singer), line: 60 in /Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line 77 /wiki/Pop_music Warning: DOMDocument::loadHTMLFile(): Tag audio invalid in http://en.wikipedia.org/wiki/Pop_music, line: 225 in /Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line 77
Warning: DOMDocument::loadHTMLFile(): Tag source invalid in http://en.wikipedia.org/wiki/Pop_music, line: 225 in /Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line 77 Pop music
With path variable:
Warning: DOMDocument::loadHTMLFile(): ID protected-icon already defined in http://en.wikipedia.org/wiki/Dido%20(singer), line: 60 in /Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line 77 /wiki/Pop_music
Warning: DOMDocument::loadHTMLFile(http://en.wikipedia.org/wiki/Pop_music%3Cbr%20/%3E): failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request in /Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line 77
Warning: DOMDocument::loadHTMLFile(): I/O warning : failed to load external entity "http://en.wikipedia.org/wiki/Pop_music%3Cbr%20/%3E" in /Applications/MAMP/htdocs/Assignments/Assignment4/test.php on line 77 "http://en.wikipedia.org/wiki/Pop_music " is invalid
Upvotes: 2
Views: 3270
Reputation: 18923
Well, the error you're getting is due to the fact that $doc
is not a DOMDocument object but it's the boolean false. Since you're suppressing DOMDocument warnings, you can't know why getHTML()
is returning false.
So, lose the @ operator, check what DOMDocument is complaining about and debug from there.
Edit:
but I am still unsure why when I pass in the variable I get a different result then when I hardcode it. When I echo both path values or url values they look identical
<br/>
tag after Pop_Music which makes the url invalid.Running this script:
$path = '/wiki/Pop_music';
$url = "http://en.wikipedia.org$path";
$doc = new \DOMDocument();
$success = @$doc->loadHTMLFile($url);
if ($success) {
$xpath = new DOMXPath($doc);
$xpathCode = "//h1[@id='firstHeading']";
$nodes = $xpath->query($xpathCode);
echo $nodes->item(0)->nodeValue."<br />";
}
produces the following result:
Pop music<br />
So, in order to troubleshoot your script there are a couple of things you should do...
Do not use @
operator. This will eat any warning thrown at you, and makes debugging a lot harder. In all truth, DOMDocument complains a lot, sometimes about errors that aren't really errors (such as some HTML5 tags). But it will also throw valid warnings, such as malformed HTML or unreachable URL.
Best way to handle this is using a custom error handler and loading it before DOMDocument.
This will enable you to digest the warnings given by DOMDocument and differentiate between important and trivial ones.
Example:
set_error_handler(function($errno, $errstr, $errfile, $errline) {
//Digest error here
});
$doc = new DOMDocument();
$isSuccessful = $doc->loadHTMLFile($url);
restore_error_handler();
Note: You can also use libxml_use_internal_errors(true);
Your getHTML function can either return a DOMDocument object or a Boolean. While that isn't a bad thing per se (and internally, PHP does that with a lot of functions), that means you can't assume $doc
is an object because it can be the boolean false. So you have to test the returned value before passing it as an argument to XDOMPath. In fact, that's the error you're getting:
You're passing a boolean to XDOMPath instead of a DOMDocument object to XDOMPath
Either throw an an exception (or error) in the function or test the returned value before passing to XDOMPath.
example:
$doc = getHTML($url, 1);
if ($doc instanceof \DOMDocument) {
$xpath = new DOMXPath($doc);
}
Upvotes: 3