Reputation: 13735
I am not able to see the cjk characters correctly. It seems that it is mistaken as in ISO-8859 encode. I think the UTF-8 encode is the appropriate one. Does anybody know how to fix the problem.
$ wget http://yjs.cd120.com/daoshi.html
$ grep 'selectid="99"' daoshi.html
Binary file daoshi.html matches
$ file daoshi.html
daoshi.html: HTML document text, ISO-8859 text, with very long lines, with CRLF line terminators
Upvotes: 0
Views: 294
Reputation: 36
First, you have to determine what is the actual encoding of the file obtained through wget
(or curl
for that matter).
Issuing the command:
grep 'Content-Type' daoshi.html
will display:
<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
where charset=gb2312
means that the html file is encoded in Simplified Chinese (GB 2312).
Then, you can use the iconv
command to convert the file to a new UTF-8 version:
iconv -f gb2312 -t utf-8 daoshi.html >daoshi-utf8.html
Finally, depending on your needs, you may want to adjust the meta
tag contents at the beginning of the file to match the new encoding, using sed
for instance:
sed s/charset=gb2312/charset=utf-8/ daoshi-utf8.html >daoshi-utf8-final.html
Upvotes: 2
Reputation: 10965
https://www.w3.org/International/questions/qa-changing-encoding
Summary:
Step 1: Save the data as UTF-8
Step 2: Declare the encoding in your page
<meta charset="utf-8"/>
Step 3: Ensure that your server does the right thing
Upvotes: 0