Bora Alp Arat
Bora Alp Arat

Reputation: 2412

PHP Curl UTF-8 Charset

I have an php script which calls another web page and writes all the html of the page and everything goes ok however there is a charset problem. My php file encoding is utf-8 and all other php files work ok (that means there is no problem with server). What is the missing thing in that code and all spanish letters look weird. PS. When I wrote these weird characters original versions into php, they all look accurate.

header("Content-Type: text/html; charset=utf-8");
function file_get_contents_curl($url)
{
    $ch=curl_init();
    curl_setopt($ch,CURLOPT_HEADER,0);
    curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
    curl_setopt($ch,CURLOPT_URL,$url);
    curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1);
    $data=curl_exec($ch);
    curl_close($ch);
    return $data;
}
$html=file_get_contents_curl($_GET["u"]);
$doc=new DOMDocument();
@$doc->loadHTML($html);

Upvotes: 34

Views: 140180

Answers (6)

amir rasabeh
amir rasabeh

Reputation: 437

You Can use this header

   header('Content-type: text/html; charset=UTF-8');

and after decoding the string

 $page = utf8_decode(curl_exec($ch));

It worked for me

Upvotes: 16

MAChitgarha
MAChitgarha

Reputation: 4288

First method (internal function)

The best way I have tried before is to use urlencode(). Keep in mind, don't use it for the whole url; instead, use it only for the needed parts. For example, a request that has two 'text-fa' and 'text-en' fields and they contain a Persian and an English text, respectively, you might only need to encode the Persian text, not the English one.

Second Method (using cURL function)

However, there are better ways if the range of characters have to be encoded is more limited. One of these ways is using CURLOPT_ENCODING, by passing it to curl_setopt():

curl_setopt($ch, CURLOPT_ENCODING, "");

Upvotes: 3

Taron
Taron

Reputation: 319

$output = curl_exec($ch);
$result = iconv("Windows-1251", "UTF-8", $output);

Upvotes: 4

michalzuber
michalzuber

Reputation: 5215

I was fetching a windows-1252 encoded file via cURL and the mb_detect_encoding(curl_exec($ch)); returned UTF-8. Tried utf8_encode(curl_exec($ch)); and the characters were correct.

Upvotes: 3

Engin Zeybekoğlu
Engin Zeybekoğlu

Reputation: 93

function page_title($val){
    include(dirname(__FILE__).'/simple_html_dom.php');
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL,$val);
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0');
    curl_setopt($ch, CURLOPT_ENCODING , "gzip");
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    $return = curl_exec($ch); 
    $encot = false;
    $charset = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);

    curl_close($ch); 
    $html = str_get_html('"'.$return.'"');

    if(strpos($charset,'charset=') !== false) {
        $c = str_replace("text/html; charset=","",$charset);
        $encot = true;
    }
    else {
        $lookat=$html->find('meta[http-equiv=Content-Type]',0);
        $chrst = $lookat->content;
        preg_match('/charset=(.+)/', $chrst, $found);
        $p = trim($found[1]);
        if(!empty($p) && $p != "")
        {
            $c = $p;
            $encot = true;
        }
    }
    $title = $html->find('title')[0]->innertext;
    if($encot == true && $c != 'utf-8' && $c != 'UTF-8') $title = mb_convert_encoding($title,'UTF-8',$c);

    return $title;
}

Upvotes: 3

julio
julio

Reputation: 406

Simple: When you use curl it encodes the string to utf-8 you just need to decode them..

Description

string utf8_decode ( string $data )

This function decodes data , assumed to be UTF-8 encoded, to ISO-8859-1.

Upvotes: 39

Related Questions