Reputation: 41
I already have the code to iterate through all files in a deep file structure where all files are utf-8 and need to be converted to c1252 a.k.a. ANSI.
I need to achieve the same simple result as coverting the file in any serious text editor... why would there be any losses? Yes, some characters are standardly replaced by different ones: Šš=Šš Čč=Èè Ťť=?? Žž=Žž Ěě=Ìì Řř=Øø Ďď=Ïï Ňň=Òò Ůů=Ùù
But since a simple string conversion like
>>> print("Šš Čč Ťť Žž Ěě Řř Ďď Ňň Ůů".encode("utf-8").decode("cp1252"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Program Files\Python310\lib\encodings\cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 8: character maps to <undefined>
... doesn't work what are my chances? I've been literally through dozens of articles here and there throughout the whole day and could not find a working solution or understand the hell of this cp conversion PITA. Found even complete functions and converter obviously written for Python 2 none working.
Also not working:
chcp 65001
Active code page: 65001
with open(fpath, mode="r", encoding="utf-8") as fd:
content = fd.read()
with open(fpath, mode="w", encoding="cp1252") as fd:
fd.write(content)
or
with open(fpath, mode="r", encoding="utf-8") as fd:
decoded = fd.decode("utf-8")
content = decoded.encode("cp1252")
Upvotes: 0
Views: 4152
Reputation: 177406
Your first example will never work. Encoding a Unicode string using one scheme and decoding to another is incorrect, but you can decode a file or byte string using the encoding it was generated with, then re-encode it in another encoding. The encodings need to support the same Unicode code points, however.
UTF-8 supports encoding all Unicode code points while CP1252 supports <256, so don't expect your files to contain the same information if you go this route.
There is an errors
parameter that can be used when decoding (reading) a file and encoding (writing) a file. Here's an example of the loss from the example string provided:
>>> s = "Šš Čč Ťť Žž Ěě Řř Ďď Ňň Ůů"
>>> s.encode('cp1252',errors='ignore').decode('cp1252')
'Šš Žž '
>>> s.encode('cp1252',errors='replace').decode('cp1252')
'Šš ?? ?? Žž ?? ?? ?? ?? ??'
There are non-lossy conversions as well, but use replacement schemes. See Error Handlers in the Python codecs documentation.
So the second example can work with loss by providing the errors
parameter:
with open(fpath, mode="r", encoding="utf-8") as fd:
content = fd.read()
with open(fpath, mode="w", encoding="cp1252", errors='replace') as fd:
fd.write(content)
Upvotes: 0