mohamed
mohamed

Reputation: 531

Encoding error for Thai Language with curl php

I m trying to curl this page and put the result in a HTML page. I used this code:

        $url= "https://web.archive.org/web/20160202021236/http://www.mpshopfashion.com";
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_TIMEOUT, 30); //timeout in seconds
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow 301 redirection

        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0');
        $html = curl_exec($ch);

The HTML page that is created looks correct when I open it with a browser but when I try to open this page with an editor , I see text like this :

à¤Ã×èͧ»ÃдѺῪÑè¹ à¤Ã×èͧ»ÃдѺῪÑè¹à¡ÒËÅÕ ÊÃéÍÂ¤Í ÊÃéÍ¢éÍÁ×Í µèÒ§ËÙ ¢Ò»ÅÕ¡-¢ÒÂÊè§

Instead of this

เครื่องประดับแฟชั่น เครื่องประดับแฟชั่นเกาหลี สร้อยคอ สร้อยข้อมือ ต่างหู ขายปลีก-ขายส่ง

Upvotes: 0

Views: 1338

Answers (2)

Álvaro González
Álvaro González

Reputation: 146430

Web sites typically declare their encoding in HTTP headers. Please note Content-Type in this screenshot from Firefox Developer Tools:

Firefox Developer Tools

TIS-620 is apparently a common legacy encoding used in Thailand (of course, UTF-8 has obsoleted all other encodings).

You editor should have a setting to select encoding, as well as access to the appropriate fonts and, sure, support for that specific encoding. Here's a screenshot from RJ TextEd:

RJ TextEd

As fallback option (after all, HTTP headers do not exist outside HTTP) HTML provides <meta> tags as an alternative to identify the encoding:

<meta http-equiv="Content-Type" content="text/html; charset=windows-874"/>

In this case we can see it doesn't even match HTTP headers.

Once more, it's up to the undisclosed specific editor you are using whether to write logic and implement meta tags checks to identify the encoding. There's simply no universal one-size-fits-all solution that works automagically in all editors ever.

Upvotes: 1

Luk&#225;š Kl&#237;ma
Luk&#225;š Kl&#237;ma

Reputation: 102

It's probably about bad encoding settings on website or even in curl request. What about use some wrapper for curl, which is really hard to set in right way.

I can recommend use Guzzle for this.

https://github.com/guzzle/guzzle

Upvotes: 0

Related Questions