dominik
dominik

Reputation: 5935

Strip first two bytes from a string in python

I need to remove the Byte Order Mark from a string. I already have the code to find the BOM but now I need to remove it from the actual string.

To give you an example. The BOM is feff and has a length of 2 bytes which means that the first two bytes of the string should not occur in the final string. However, when I use the Python string stripping, too much is stripped from the string.

Code snippet:

print len(bom)
print as_hex(bom)
print string
print as_hex(string)
string = string[len(bom):]
print string
print as_hex(string)

Output:

2
feff
Organ
feff4f7267616e
rgan
7267616e

What I hope to get is:

2
feff
Organ
feff4f7267616e
Organ
4f7267616e

The as_hex() function just prints the characters as hex ("".join('%02x' % ord(c) for c in bytes)).

Upvotes: 0

Views: 4913

Answers (2)

Mark Tolonen
Mark Tolonen

Reputation: 177600

Use the right codec and the BOM will be handled for you. Decoding with utf-8-sig and utf16 will remove a leading BOM if present. Encoding with them will add the BOM. If you do not want a BOM then use utf-8, utf-16le or utf-16be.

You typically should decode to Unicode when reading text data into a program, and encode to bytes when writing to file, console, socket, etc.

unicode_str = u'test'
utf8_w_bom = unicode_str.encode('utf-8-sig')
utf16_w_bom = unicode_str.encode('utf16')
utf8_wo_bom = unicode_str.encode('utf-8')
utf16_wo_bom = unicode_str.encode('utf-16le')
print repr(utf8_w_bom)
print repr(utf16_w_bom)
print repr(utf8_wo_bom)
print repr(utf16_wo_bom)
print repr(utf8_w_bom.decode('utf-8-sig'))
print repr(utf16_w_bom.decode('utf16'))
print repr(utf8_wo_bom.decode('utf-8-sig'))
print repr(utf16_wo_bom.decode('utf16'))

Output:

'\xef\xbb\xbftest'
'\xff\xfet\x00e\x00s\x00t\x00'
'test'
't\x00e\x00s\x00t\x00'
u'test'
u'test'
u'test'
u'test'

Note that on decode utf16 will assume the native byte order if there is no BOM.

Upvotes: 0

Weeble
Weeble

Reputation: 17900

I think you have a unicode string object. (If you're using Python 3 you certainly do, since it's the only kind of string.) Your as_hex function isn't printing out "fe" for the first character and "ff" for the second. It's printing out "feff" for the first unicode character in the string. For example (Python 3):

>>> mystr = "\ufeffHello world."
>>> mystr[0]
'\ufeff'
>>> '%02x' % ord(mystr[0])
'feff'

You either need to remove just one unicode character, or to store your string in a bytes object instead and remove two bytes.

(This doesn't explain why len(bom) is 2, and I can't tell without seeing more of your code. I'd guess that bom is a list or a bytes object, not a unicode string.)


My answer above assumes Python 3, but I've realised from your print statements that you're using Python 2. Based on that, I'd guess that bom is an ASCII string while string is a unicode string. If you use print repr(x) instead of print x it will let you tell the difference between unicode and ASCII strings.

Upvotes: 4

Related Questions