Álvaro N. Franz
Álvaro N. Franz

Reputation: 1228

UTF8 with file_get_contents()

I'm using file_get_contents() to get HTML and scrap some data from a website. The source is not always UTF8 but I am using the FORCEUTF8 class to fix it. It doesn't work fine though. What am I doing wrong?

/* Load UTF8 HTML */
require_once('/ForceUTF8/Encoding.php');
use \ForceUTF8\Encoding;
function loadHTMLInUtf8($url){
$utf8_or_latin1_or_mixed_string=file_get_contents($url);
return Encoding::toUTF8($utf8_or_latin1_or_mixed_string);
}    

$html=loadHTMLInUtf8('http://www.example.com/');
$dom->loadHTML('<meta http-equiv="content-type" content="text/html; charset=utf-8">'.$html);

Is there an alternative way of doing this?

Upvotes: 0

Views: 3766

Answers (2)

jabbink
jabbink

Reputation: 1291

file_get_contents is known to destroy UTF8 encoding.

Try something like this:

<?php
function file_get_contents_utf8($fn) {
    $content = file_get_contents($fn);
    return mb_convert_encoding($content, 'UTF-8',
        mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
}
?>

If this does not work, could you please give an example URL, where this does not work? (I checked the source of the FORCEUTF8 library, and that does not look very efficient and I guess, this small function could do the same (and it's native in the PHP-code)).

Upvotes: 1

jan
jan

Reputation: 142

You can use the method "utf8_encode". It should do the same as the written method above.

Upvotes: 2

Related Questions