Msg
Msg

Reputation: 142

How to change encoding of characters from file

I have been reading quite a bit about encoding, and I'm still not sure I'm fully wrapping my head around it. I have a file encoded as ANSI with the word "Solluções" in it. I want to convert the file to UTF-8, but whenever I do it changes the characters.

Code:

with codecs.open(filename_in,'r') 
   as input_file, 
   codecs.open(filename_out,'w','utf-8') as output_file:
   output_file.write(input_file.read())

Result: "Solluções"

I imagine this is a stupid problem, but I am at an impasse at the moment. I tried to call encode('utf-8') on the individual data in the file prior to writing it to no avail, so I'm guessing that's not correct either... I appreciate any help, thank you!

Upvotes: 1

Views: 161

Answers (2)

Josh Durham
Josh Durham

Reputation: 1662

This SO answer to a similar question specifies the input type of the file like codecs.open(sourceFileName, "r", "your-source-encoding"). Without that, python may not interpret the characters correctly if it can't detect the original encoding.

Warning about the encodings: Most people talking about ANSI refer to one of the Windows codepages; you may really have a file in CP (codepage) 1252, which is almost, but not quite the same thing as ISO-8859-1 (Latin 1). If so, use cp-1252 instead of latin-1 as your-source-encoding.

Upvotes: 1

Joran Beasley
Joran Beasley

Reputation: 113940

you can try

  from codecs import encode,decode
  with open(filename_out,"w") as output_file:
       decoded_unicode = decode(input_file.read(),"cp-1252") #im guessing this is what you mean by "ANSI"
       utf8_bytes = encode(decoded_unicode,"utf8")
       output_file.write(utf8_bytes)

Upvotes: 1

Related Questions