mohamed
mohamed

Reputation: 541

Curl of page with russian language

I m getting encoding problem when doing curl using php of this page that is in russian language https://web.archive.org/web/20060403041216/http://inostranets.ru:80/

Here below the code that I m using :

$url="https://web.archive.org/web/20060403041216/http://inostranets.ru:80/";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);         
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch,CURLOPT_USERAGENT,'waybackmachinedownloader');
$html = curl_exec($ch);

As result I m getting caracters similar to this: "ÂÍÅ ÊÎÍÊÓÐÅÍÖÈÈ – ÑÊÀÇÎ×ÍÛÉ ÑÈÍÃÀÏÓÐ Òóðîïåðàòîð «ÄÅλ ïðèãëàøàåò Âàñ ïîñåòèòü"

Please check image below

enter image description here

Upvotes: 1

Views: 2933

Answers (2)

mohamed
mohamed

Reputation: 541

I found the problem.

I just have to encode the output like this:

$html = mb_convert_encoding($html, "UTF-8", "Windows-1251"); 

instead of:

$html = mb_convert_encoding($html, "UTF-8", "Windows-1251 (CP1251)"); 

Upvotes: 0

Pedro Lobito
Pedro Lobito

Reputation: 98961

The page you're trying to parse is windows-1251 encoded. To tell the browser you're outputting windows-1251, you can use:

header('Content-Type: text/html; charset=windows-1251'); ,

i.e.:

$url="https://web.archive.org/web/20060403041216/http://inostranets.ru:80/";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch,CURLOPT_USERAGENT,'waybackmachinedownloader');
$html = curl_exec($ch);

header('Content-Type: text/html; charset=windows-1251');
print $html;

Update:

To save the $html to a file use:

file_put_contents("curl_russian.html", $html);

Note:

When you open the html file, make sure you select Text Encoding to Cyrillic Windows on your browser.


enter image description here

Upvotes: 2

Related Questions