Déjà vu
Déjà vu

Reputation: 783

PHP GetElementsByTagName Error

I wrote a very long code and I had to retrieve alot of tags from a website, for example the title, h1, h2, h3, a, p and so on.I firstly did this with preg_match but realised very soon this is the wrong way of doing it. So i started using this:

function getTextBetweenTags($string, $tagname){
    $d = new DOMDocument();
    $d->loadHTML($string);
    $return = array();
    foreach($d->getElementsByTagName($tagname) as $item){
        $return[] = $item->textContent;
    }
    return $return;
}

and to retrieve a tag: $title = getTextBetweenTags($contents, 'title');

This worked fine on the testpage is was using, which was a wikipedia page.

But as soon as i tested it on another page it gave me alot of errors like these:

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Input is not proper UTF-8, indicate encoding ! in Entity

and after this one alot of:

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRef: expecting ';' in Entity,

I did some research and found out that it's expecting a '&' instead of & for example so after every 'special character' it expects a ; but as it uses file_get_contents to retrieve the contents ($file_contents = file_get_contents($url);). I REALLY don't want to go back to preg_match (for obvious reasons) So i'm asking if maybe of you knows how to fix my problem.

Thanks in advance!

Upvotes: 0

Views: 90

Answers (1)

Amal Murali
Amal Murali

Reputation: 76646

You can work around this problem using libxml_use_internal_errors(). Currently, your function calls loadHTML() every time the function is called. I would just load the HTML outside the function and pass it as a parameter.

$dom = new DOMDocument();
$errorState = libxml_use_internal_errors(TRUE); // don't display errors
$dom->loadHTML($string);
libxml_use_internal_errors($errorState); // reset the state

function getTextBetweenTags(DOMDocument $dom, $string, $tagname) {
    $return = array();
    foreach($dom->getElementsByTagName($tagname) as $item){
        $return[] = $item->textContent;
    }
    return $return;
}

Example usage:

$string = file_get_contents($url);
$title = getTextBetweenTags($dom, $string, 'title');

Upvotes: 1

Related Questions