Reputation: 40185
I am processing a few thousand xml files and have a few problem files.
In each case, they contain leading Unicode characters, such as C3 AF C2 BB C2 BF
and EF BB BF
, etc.
In all cases, the file contains only ASCII characters (after the header bytes), so that there would be no risk of data loss converting them to ASCII.
I am not allowed to change the contents of the files on disk, only use them as input to my script.
At its simplest, I would be happy to convert such files to ASCII (all input files are parsed, some changes made and written to an output directory, where a second script will process them.)
How would I code that? When I try:
with open(filePath, "rb") as file:
contentOfFile = file.read()
unicodeData = contentOfFile.decode("utf-8")
asciiData = unicodeData.encode("ascii", "ignore")
with open(filePath, 'wt') as file:
file.write(asciiData)
I get an error must be str, not bytes
.
I also tried
asciiData = unicodedata.normalize('NFKD', unicodeData).encode('ASCII', 'ignore')
with the same result. How do I correct that?
Or is there any other way to covert the file?
Upvotes: 2
Views: 282
Reputation: 369334
...
asciiData = unicodeData.encode("ascii", "ignore")
asciiData
is bytes object because it's encoded. You need to use binary mode instead of text mode when opening file:
with open(filePath, 'wb') as file: # <---
file.write(asciiData)
Upvotes: 3