Mawg
Mawg

Reputation: 40185

How to strip the leading Unciode characters from a file?

I am processing a few thousand xml files and have a few problem files.

In each case, they contain leading Unicode characters, such as C3 AF C2 BB C2 BF and EF BB BF, etc.

In all cases, the file contains only ASCII characters (after the header bytes), so that there would be no risk of data loss converting them to ASCII.

I am not allowed to change the contents of the files on disk, only use them as input to my script.

At its simplest, I would be happy to convert such files to ASCII (all input files are parsed, some changes made and written to an output directory, where a second script will process them.)

How would I code that? When I try:

with open(filePath, "rb") as file:
    contentOfFile = file.read()

unicodeData = contentOfFile.decode("utf-8")
asciiData = unicodeData.encode("ascii", "ignore")

with open(filePath, 'wt')  as file:
    file.write(asciiData)

I get an error must be str, not bytes.

I also tried

    asciiData = unicodedata.normalize('NFKD', unicodeData).encode('ASCII', 'ignore')

with the same result. How do I correct that?

Or is there any other way to covert the file?

Upvotes: 2

Views: 282

Answers (1)

falsetru
falsetru

Reputation: 369334

...
asciiData = unicodeData.encode("ascii", "ignore")

asciiData is bytes object because it's encoded. You need to use binary mode instead of text mode when opening file:

with open(filePath, 'wb')  as file:  # <---
    file.write(asciiData)

Upvotes: 3

Related Questions