Reputation: 3749
I'm trying to parse a HTML page, but the encoding is messing my results. After some research I found a very popular solution using utf8_encode()
and utf8_decode()
, but it doesn't change anything. In the following lines, you can check my code and the output.
$str_html = $this->curlHelper->file_get_contents_curl($page);
$str_html = utf8_encode($str_html);
$dom = new DOMDocument();
$dom->resolveExternals = true;
$dom->substituteEntities = false;
@$dom->loadHTML($str_html);
$xpath = new DomXpath($dom);
(...)
$profile = array();
for ($index = 0; $index < $table_lines->length; $index++) {
$desc = utf8_decode($table_lines->item($index)->firstChild->nodeValue);
}
Testar é bom
Should be
Testar é bom
htmlentities():
htmlentities($table_lines->item($index)->lastChild->nodeValue, ENT_NOQUOTES, ini_get('ISO-8859-1'), false);
htmlspecialchars():
htmlspecialchars($table_lines->item($index)->lastChild->nodeValue, ENT_NOQUOTES, 'ISO- 8859-1', false);
Change my file's charset as decribed here.
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1" />
Thanks in advance!
Upvotes: 0
Views: 257
Reputation: 16061
Try using the following without a prior utf8_decode()
:
mb_convert_encoding($str, 'ISO-8859-1', 'UTF-8');
Alternatively, don't use utf8_decode()
and try to change your website meta to:
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />
Upvotes: 3