Reputation: 1507
Can someone please help me deal with byte-order mark (BOM) bytes versus UTF8 characters in the first line of an XHTML file?
Using Python 3.5, I opened the XHTML file as UTF8 text:
inputTopicFile = open(inputFileName, "rt", encoding="utf8")
As shown in this hex-editor, the first line of that UTF8-encoded XHTML file begins with the three-bytes UTF8 BOM EF BB BF
:
I wanted to remove the UTF8 BOM from what I supposed were equivalent to the three initial character positions [0:2]
in the string. So I tried this:
firstLine = firstLine[3:]
Didn't work -- the characters <?
were no longer present at the start of the resulting line.
So I did this experiment:
for charPos in range(0, 3):
print("charPos {0} == {1}".format(charPos, firstLine[charPos]))
Which printed:
charPos 0 ==
charPos 1 == <
charPos 2 == ?
I then added .encode
to that loop as follows:
for charPos in range(0, 3):
print("charPos {0} == {1}".format(charPos, eachLine[charPos].encode('utf8')))
Which gave me:
charPos 0 == b'\xef\xbb\xbf'
charPos 1 == b'<'
charPos 2 == b'?'
Evidently Python 3 in some way "knows" that the 3-bytes BOM is a single unit of non-character data? Meaning that one cannot try to process the first three 8-bit bytes(?) in the line as if they were UTF8 characters?
At this point I know that I can "trick" my code into giving me with I want by specifying firstLine = firstLine[1:]
. But it seems wrong to do it that way(?)
So what's the correct way to discard the first three BOM bytes in a UTF8 string on the way to working with only the UTF8 characters?
EDIT: The solution, per the comment made by Anthony Sottile, turned out to be as simple as using encoding="utf-8-sig"
when I opened the source XHTML file:
inputTopicFile = open(inputFileName, "rt", encoding="utf-8-sig")
That strips out the BOM. Voila!
Upvotes: 2
Views: 205
Reputation: 7443
As you mentioned in your edit, you can open the file with the utf8-sig
encoding, but to answer your question of why it was behaving this way:
Python 3 distinguishes between byte strings (the ones with the b
prefix) and character strings (without the b
prefix), and prefers to use character strings whenever possible. A byte string works with the actual bytes; a character string works with Unicode codepoints. The BOM is a single codepoint, U+FEFF, so in a regular string Python 3 will treat it as a single character (because it is a single character). When you call encode
, you turn the character string into a byte string.
Thus the results you were seeing are exactly what you should have: Python 3 does know what counts as a single character, which is all it sees until you call encode
.
Upvotes: 1