Reputation: 541
I m getting encoding problem when doing curl using php of this page that is in russian language https://web.archive.org/web/20060403041216/http://inostranets.ru:80/
Here below the code that I m using :
$url="https://web.archive.org/web/20060403041216/http://inostranets.ru:80/";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch,CURLOPT_USERAGENT,'waybackmachinedownloader');
$html = curl_exec($ch);
As result I m getting caracters similar to this: "ÂÍÅ ÊÎÍÊÓÐÅÍÖÈÈ – ÑÊÀÇÎ×ÍÛÉ ÑÈÍÃÀÏÓÐ Òóðîïåðàòîð «ÄÅλ ïðèãëàøàåò Âàñ ïîñåòèòü"
Please check image below
Upvotes: 1
Views: 2933
Reputation: 541
I found the problem.
I just have to encode the output like this:
$html = mb_convert_encoding($html, "UTF-8", "Windows-1251");
instead of:
$html = mb_convert_encoding($html, "UTF-8", "Windows-1251 (CP1251)");
Upvotes: 0
Reputation: 98961
The page you're trying to parse is windows-1251
encoded.
To tell the browser you're outputting windows-1251
, you can use:
header('Content-Type: text/html; charset=windows-1251');
,
i.e.:
$url="https://web.archive.org/web/20060403041216/http://inostranets.ru:80/";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch,CURLOPT_USERAGENT,'waybackmachinedownloader');
$html = curl_exec($ch);
header('Content-Type: text/html; charset=windows-1251');
print $html;
Update:
To save the $html
to a file use:
file_put_contents("curl_russian.html", $html);
Note:
When you open the html
file, make sure you select Text Encoding
to Cyrillic Windows
on your browser.
Upvotes: 2