RBV
RBV

Reputation: 1507

Confusion regarding UTF8 substring length

Can someone please help me deal with byte-order mark (BOM) bytes versus UTF8 characters in the first line of an XHTML file?

Using Python 3.5, I opened the XHTML file as UTF8 text:

inputTopicFile = open(inputFileName, "rt", encoding="utf8")

As shown in this hex-editor, the first line of that UTF8-encoded XHTML file begins with the three-bytes UTF8 BOM EF BB BF:

Hex-editor view of data in a UTF8 file

I wanted to remove the UTF8 BOM from what I supposed were equivalent to the three initial character positions [0:2] in the string. So I tried this:

firstLine = firstLine[3:]

Didn't work -- the characters <? were no longer present at the start of the resulting line.

So I did this experiment:

for charPos in range(0, 3):
    print("charPos {0} == {1}".format(charPos, firstLine[charPos]))

Which printed:

charPos 0 == 
charPos 1 == <
charPos 2 == ?

I then added .encode to that loop as follows:

for charPos in range(0, 3):
    print("charPos {0} == {1}".format(charPos, eachLine[charPos].encode('utf8')))

Which gave me:

charPos 0 == b'\xef\xbb\xbf'
charPos 1 == b'<'
charPos 2 == b'?'

Evidently Python 3 in some way "knows" that the 3-bytes BOM is a single unit of non-character data? Meaning that one cannot try to process the first three 8-bit bytes(?) in the line as if they were UTF8 characters?

At this point I know that I can "trick" my code into giving me with I want by specifying firstLine = firstLine[1:]. But it seems wrong to do it that way(?)

So what's the correct way to discard the first three BOM bytes in a UTF8 string on the way to working with only the UTF8 characters?


EDIT: The solution, per the comment made by Anthony Sottile, turned out to be as simple as using encoding="utf-8-sig" when I opened the source XHTML file:

inputTopicFile = open(inputFileName, "rt", encoding="utf-8-sig")

That strips out the BOM. Voila!

Upvotes: 2

Views: 205

Answers (1)

Daniel H
Daniel H

Reputation: 7443

As you mentioned in your edit, you can open the file with the utf8-sig encoding, but to answer your question of why it was behaving this way:

Python 3 distinguishes between byte strings (the ones with the b prefix) and character strings (without the b prefix), and prefers to use character strings whenever possible. A byte string works with the actual bytes; a character string works with Unicode codepoints. The BOM is a single codepoint, U+FEFF, so in a regular string Python 3 will treat it as a single character (because it is a single character). When you call encode, you turn the character string into a byte string.

Thus the results you were seeing are exactly what you should have: Python 3 does know what counts as a single character, which is all it sees until you call encode.

Upvotes: 1

Related Questions