Reputation: 2840
I have a Big5 encoded file, which can't be opened by Mac TextEdit. I wonder how to convert the whole file into utf8 encoding, since utf8 is much more universal and common.
I have tried using iconv in my terminal, but it does not work. I can't find anything useful about this error by Google either.
$ iconv -f BIG5 -t UTF8 in.txt > out.txt
iconv: in.txt:5:0: cannot convert
Are there any other ways to convert?
I got the txt file from here, whcih is a list of Chinese names writing in Taiwan Traditional Chinese.
Upvotes: 3
Views: 9212
Reputation: 1
I had to do something similar and what worked for me was this:
Right click on the file and Open with... Google Chrome. Or if that doesn't work, open with Safari and then from the View menu, choose Text Encoding> and then choose Big5 or whatever it is.
Next, Cmd-A to select all the text.
Now paste that text into a new blank document in TextEdit.
Now Save... and make sure to choose UTF-8.
Upvotes: 0
Reputation: 1282
Looking at the first 20 lines of your file, it is clear that the encoding uses the byte 0x8C as first byte of some multibyte sequences. The encodings that have this property are:
Try them in turn:
$ for encoding in BIG5 BIG5-HKSCS CP932 CP936 CP949 CP950 GB18030 GBK \
JOHAB Shift_JIS Shift_JISX0213; do \
if head -n 20 < unique_names_2012.txt | iconv -f $encoding -t UTF-8 > /dev/null 2> /dev/null; then \
echo $encoding ; \
fi; \
done
With GNU libiconv, it prints
BIG5-HKSCS
CP950
GB18030
Is it in GB18030 encoding?
$ iconv -f GB18030 < unique_names_2012.txt
shows hundreds of lines that use characters in the PUA range. While not impossible, it seems unlikely.
Is it in CP950 encoding?
$ iconv -f CP950 < unique_names_2012.txt
gives a conversion error at line 2294.
Is it in BIG5-HKSCS encoding?
$ iconv -f BIG5-HKSCS < unique_names_2012.txt
gives a conversion error at line 713.
So, most probably the file is encoded in a variant of BIG5. There are many such variants, see http://haible.de/bruno/charsets/conversion-tables/Chinese.html. Possibly it's an extension of CP950 or an extension of BIG5-HKSCS (since these are the most popular encodings from the BIG5 family today).
In summary, such conversion errors are caused by unstandardized proliferation of BIG5 variants.
The best thing you can do is to request the original file in UTF-8 encoding; let the originator deal with it.
Upvotes: 7