How to use codecs to convert from one language to a utf-8 encoded document?

Question

So basically I am still pretty new to Python and I have a problem where I have a document in Japanese that I am trying to convert to a utf-8 encoded document. I don't really know what I should be getting in return when I do this. When i run the program I currently have, it just deletes everything and leaves me with a blank utf-8 encoded document. Here is what I have, any help is greatly appreciated.

EDIT: I'm sorry it was a typo, I fixed the original encoding. It is Shift-jis.

import codecs

codecs.open("rshmn10j.txt", 'r', encoding='shift-jis')

newfile = codecs.open("rshmn10j.txt", 'w', encoding='utf-8')
newfile.write(u'\ufeff')
newfile.close()

monkut · Accepted Answer

if you're trying to convert a document from encoding "x" to encoding "utf8", you first have to read the document using the encoding it is encoded in.

import codecs

original_document_encoding = "shift-jis" # common japanese encoding.
with codecs.open("rshmn10j.txt", 'r', encoding=original_document_encoding) as in_f:
    unicode_content = in_f.read()

with codecs.open("rshmn10j.out.txt", 'w', encoding='utf-8') as out_f:
    out_f.write(unicode_content)

with is used here to auto-close the file when the block is exited.

How to use codecs to convert from one language to a utf-8 encoded document?

Answers (1)

Related Questions