Reputation: 142
I have been reading quite a bit about encoding, and I'm still not sure I'm fully wrapping my head around it. I have a file encoded as ANSI with the word "Solluções" in it. I want to convert the file to UTF-8, but whenever I do it changes the characters.
Code:
with codecs.open(filename_in,'r')
as input_file,
codecs.open(filename_out,'w','utf-8') as output_file:
output_file.write(input_file.read())
Result: "Solluções"
I imagine this is a stupid problem, but I am at an impasse at the moment. I tried to call encode('utf-8') on the individual data in the file prior to writing it to no avail, so I'm guessing that's not correct either... I appreciate any help, thank you!
Upvotes: 1
Views: 161
Reputation: 1662
This SO answer to a similar question specifies the input type of the file like codecs.open(sourceFileName, "r", "your-source-encoding")
. Without that, python may not interpret the characters correctly if it can't detect the original encoding.
Warning about the encodings: Most people talking about ANSI refer to one of the Windows codepages; you may really have a file in CP (codepage) 1252, which is almost, but not quite the same thing as ISO-8859-1 (Latin 1). If so, use cp-1252
instead of latin-1
as your-source-encoding
.
Upvotes: 1
Reputation: 113940
you can try
from codecs import encode,decode
with open(filename_out,"w") as output_file:
decoded_unicode = decode(input_file.read(),"cp-1252") #im guessing this is what you mean by "ANSI"
utf8_bytes = encode(decoded_unicode,"utf8")
output_file.write(utf8_bytes)
Upvotes: 1