Reputation: 783
I wrote a very long code and I had to retrieve alot of tags from a website, for example the title, h1, h2, h3, a, p and so on.I firstly did this with preg_match but realised very soon this is the wrong way of doing it. So i started using this:
function getTextBetweenTags($string, $tagname){
$d = new DOMDocument();
$d->loadHTML($string);
$return = array();
foreach($d->getElementsByTagName($tagname) as $item){
$return[] = $item->textContent;
}
return $return;
}
and to retrieve a tag: $title = getTextBetweenTags($contents, 'title');
This worked fine on the testpage is was using, which was a wikipedia page.
But as soon as i tested it on another page it gave me alot of errors like these:
Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Input is not proper UTF-8, indicate encoding ! in Entity
and after this one alot of:
Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRef: expecting ';' in Entity,
I did some research and found out that it's expecting a '&'
instead of &
for example so after every 'special character' it expects a ;
but as it uses file_get_contents
to retrieve the contents ($file_contents = file_get_contents($url);
). I REALLY don't want to go back to preg_match
(for obvious reasons) So i'm asking if maybe of you knows how to fix my problem.
Thanks in advance!
Upvotes: 0
Views: 90
Reputation: 76646
You can work around this problem using libxml_use_internal_errors()
. Currently, your function calls loadHTML()
every time the function is called. I would just load the HTML outside the function and pass it as a parameter.
$dom = new DOMDocument();
$errorState = libxml_use_internal_errors(TRUE); // don't display errors
$dom->loadHTML($string);
libxml_use_internal_errors($errorState); // reset the state
function getTextBetweenTags(DOMDocument $dom, $string, $tagname) {
$return = array();
foreach($dom->getElementsByTagName($tagname) as $item){
$return[] = $item->textContent;
}
return $return;
}
Example usage:
$string = file_get_contents($url);
$title = getTextBetweenTags($dom, $string, 'title');
Upvotes: 1