Confusion regarding UTF8 substring length

Question

Can someone please help me deal with byte-order mark (BOM) bytes versus UTF8 characters in the first line of an XHTML file?

Using Python 3.5, I opened the XHTML file as UTF8 text:

inputTopicFile = open(inputFileName, "rt", encoding="utf8")

As shown in this hex-editor, the first line of that UTF8-encoded XHTML file begins with the three-bytes UTF8 BOM EF BB BF:

I wanted to remove the UTF8 BOM from what I supposed were equivalent to the three initial character positions [0:2] in the string. So I tried this:

firstLine = firstLine[3:]

Didn't work -- the characters were no longer present at the start of the resulting line.



So I did this experiment:

for charPos in range(0, 3):
    print("charPos {0} == {1}".format(charPos, firstLine[charPos]))


Which printed:

charPos 0 == 
charPos 1 == <
charPos 2 == ?


I then added .encode to that loop as follows:

for charPos in range(0, 3):
    print("charPos {0} == {1}".format(charPos, eachLine[charPos].encode('utf8')))


Which gave me:

charPos 0 == b'\xef\xbb\xbf'
charPos 1 == b'<'
charPos 2 == b'?'


Evidently Python 3 in some way "knows" that the 3-bytes BOM is a single unit of non-character data? Meaning that one cannot try to process the first three 8-bit bytes(?) in the line as if they were UTF8 characters?

At this point I know that I can "trick" my code into giving me with I want by specifying firstLine = firstLine[1:]. But it seems wrong to do it that way(?)

So what's the correct way to discard the first three BOM bytes in a UTF8 string on the way to working with only the UTF8 characters?



EDIT: The solution, per the comment made by Anthony Sottile, turned out to be as simple as using encoding="utf-8-sig" when I opened the source XHTML file: 

inputTopicFile = open(inputFileName, "rt", encoding="utf-8-sig")


That strips out the BOM. Voila!

Confusion regarding UTF8 substring length

Answers (1)

Related Questions