Jay Wang
Jay Wang

Reputation: 2840

How to convert a Big5 encoded txt file to UTF8 encoded txt file?

I have a Big5 encoded file, which can't be opened by Mac TextEdit. I wonder how to convert the whole file into utf8 encoding, since utf8 is much more universal and common.

I have tried using iconv in my terminal, but it does not work. I can't find anything useful about this error by Google either.

$ iconv -f BIG5 -t UTF8 in.txt > out.txt
iconv: in.txt:5:0: cannot convert

Are there any other ways to convert?

I got the txt file from here, whcih is a list of Chinese names writing in Taiwan Traditional Chinese.

Upvotes: 3

Views: 9212

Answers (2)

Tyler
Tyler

Reputation: 1

I had to do something similar and what worked for me was this:

Right click on the file and Open with... Google Chrome. Or if that doesn't work, open with Safari and then from the View menu, choose Text Encoding> and then choose Big5 or whatever it is.

Next, Cmd-A to select all the text.

Now paste that text into a new blank document in TextEdit.

Now Save... and make sure to choose UTF-8.

Upvotes: 0

Bruno Haible
Bruno Haible

Reputation: 1282

Looking at the first 20 lines of your file, it is clear that the encoding uses the byte 0x8C as first byte of some multibyte sequences. The encodings that have this property are:

  • BIG5
  • BIG5-HKSCS
  • CP932
  • CP936
  • CP949
  • CP950
  • GB18030
  • GBK
  • JOHAB
  • Shift_JIS
  • Shift_JISX0213

Try them in turn:

$ for encoding in BIG5 BIG5-HKSCS CP932 CP936 CP949 CP950 GB18030 GBK \
                  JOHAB Shift_JIS Shift_JISX0213; do \
  if head -n 20 < unique_names_2012.txt | iconv -f $encoding -t UTF-8 > /dev/null 2> /dev/null; then \
    echo $encoding ; \
  fi; \
done

With GNU libiconv, it prints

BIG5-HKSCS
CP950
GB18030

Is it in GB18030 encoding?

$ iconv -f GB18030 < unique_names_2012.txt

shows hundreds of lines that use characters in the PUA range. While not impossible, it seems unlikely.

Is it in CP950 encoding?

$ iconv -f CP950 < unique_names_2012.txt

gives a conversion error at line 2294.

Is it in BIG5-HKSCS encoding?

$ iconv -f BIG5-HKSCS < unique_names_2012.txt

gives a conversion error at line 713.

So, most probably the file is encoded in a variant of BIG5. There are many such variants, see http://haible.de/bruno/charsets/conversion-tables/Chinese.html. Possibly it's an extension of CP950 or an extension of BIG5-HKSCS (since these are the most popular encodings from the BIG5 family today).

In summary, such conversion errors are caused by unstandardized proliferation of BIG5 variants.

The best thing you can do is to request the original file in UTF-8 encoding; let the originator deal with it.

Upvotes: 7

Related Questions