Strip first two bytes from a string in python

Question

I need to remove the Byte Order Mark from a string. I already have the code to find the BOM but now I need to remove it from the actual string.

To give you an example. The BOM is feff and has a length of 2 bytes which means that the first two bytes of the string should not occur in the final string. However, when I use the Python string stripping, too much is stripped from the string.

Code snippet:

print len(bom)
print as_hex(bom)
print string
print as_hex(string)
string = string[len(bom):]
print string
print as_hex(string)

Output:

2
feff
Organ
feff4f7267616e
rgan
7267616e

What I hope to get is:

2
feff
Organ
feff4f7267616e
Organ
4f7267616e

The as_hex() function just prints the characters as hex ("".join('%02x' % ord(c) for c in bytes)).

Weeble · Accepted Answer

I think you have a unicode string object. (If you're using Python 3 you certainly do, since it's the only kind of string.) Your as_hex function isn't printing out "fe" for the first character and "ff" for the second. It's printing out "feff" for the first unicode character in the string. For example (Python 3):

>>> mystr = "\ufeffHello world."
>>> mystr[0]
'\ufeff'
>>> '%02x' % ord(mystr[0])
'feff'

You either need to remove just one unicode character, or to store your string in a bytes object instead and remove two bytes.

(This doesn't explain why len(bom) is 2, and I can't tell without seeing more of your code. I'd guess that bom is a list or a bytes object, not a unicode string.)

My answer above assumes Python 3, but I've realised from your print statements that you're using Python 2. Based on that, I'd guess that bom is an ASCII string while string is a unicode string. If you use print repr(x) instead of print x it will let you tell the difference between unicode and ASCII strings.

Strip first two bytes from a string in python

Answers (2)

Related Questions