Reputation: 1228
I'm using file_get_contents() to get HTML and scrap some data from a website. The source is not always UTF8 but I am using the FORCEUTF8 class to fix it. It doesn't work fine though. What am I doing wrong?
/* Load UTF8 HTML */
require_once('/ForceUTF8/Encoding.php');
use \ForceUTF8\Encoding;
function loadHTMLInUtf8($url){
$utf8_or_latin1_or_mixed_string=file_get_contents($url);
return Encoding::toUTF8($utf8_or_latin1_or_mixed_string);
}
$html=loadHTMLInUtf8('http://www.example.com/');
$dom->loadHTML('<meta http-equiv="content-type" content="text/html; charset=utf-8">'.$html);
Is there an alternative way of doing this?
Upvotes: 0
Views: 3766
Reputation: 1291
file_get_contents
is known to destroy UTF8 encoding.
Try something like this:
<?php
function file_get_contents_utf8($fn) {
$content = file_get_contents($fn);
return mb_convert_encoding($content, 'UTF-8',
mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
}
?>
If this does not work, could you please give an example URL, where this does not work? (I checked the source of the FORCEUTF8 library, and that does not look very efficient and I guess, this small function could do the same (and it's native in the PHP-code)).
Upvotes: 1
Reputation: 142
You can use the method "utf8_encode". It should do the same as the written method above.
Upvotes: 2