user1424739
user1424739

Reputation: 13735

How detail with cjk character correctly for webpage?

I am not able to see the cjk characters correctly. It seems that it is mistaken as in ISO-8859 encode. I think the UTF-8 encode is the appropriate one. Does anybody know how to fix the problem.

$ wget http://yjs.cd120.com/daoshi.html 
$ grep 'selectid="99"' daoshi.html 
Binary file daoshi.html matches
$ file daoshi.html 
daoshi.html: HTML document text, ISO-8859 text, with very long lines, with CRLF line terminators

Upvotes: 0

Views: 294

Answers (2)

Archelangelo
Archelangelo

Reputation: 36

First, you have to determine what is the actual encoding of the file obtained through wget (or curl for that matter).

Issuing the command:

grep 'Content-Type' daoshi.html

will display:

<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />

where charset=gb2312 means that the html file is encoded in Simplified Chinese (GB 2312).

Then, you can use the iconv command to convert the file to a new UTF-8 version:

iconv -f gb2312 -t utf-8 daoshi.html >daoshi-utf8.html

Finally, depending on your needs, you may want to adjust the meta tag contents at the beginning of the file to match the new encoding, using sed for instance:

sed s/charset=gb2312/charset=utf-8/ daoshi-utf8.html >daoshi-utf8-final.html

Upvotes: 2

Intervalia
Intervalia

Reputation: 10965

https://www.w3.org/International/questions/qa-changing-encoding

Summary:

Step 1: Save the data as UTF-8

Step 2: Declare the encoding in your page

<meta charset="utf-8"/>

Step 3: Ensure that your server does the right thing

Upvotes: 0

Related Questions