Reputation: 5935
I need to remove the Byte Order Mark from a string. I already have the code to find the BOM but now I need to remove it from the actual string.
To give you an example. The BOM is feff
and has a length of 2 bytes which means that the first two bytes of the string should not occur in the final string. However, when I use the Python string stripping, too much is stripped from the string.
Code snippet:
print len(bom)
print as_hex(bom)
print string
print as_hex(string)
string = string[len(bom):]
print string
print as_hex(string)
Output:
2
feff
Organ
feff4f7267616e
rgan
7267616e
What I hope to get is:
2
feff
Organ
feff4f7267616e
Organ
4f7267616e
The as_hex()
function just prints the characters as hex ("".join('%02x' % ord(c) for c in bytes
)).
Upvotes: 0
Views: 4913
Reputation: 177600
Use the right codec and the BOM will be handled for you. Decoding with utf-8-sig
and utf16
will remove a leading BOM if present. Encoding with them will add the BOM. If you do not want a BOM then use utf-8
, utf-16le
or utf-16be
.
You typically should decode to Unicode when reading text data into a program, and encode to bytes when writing to file, console, socket, etc.
unicode_str = u'test'
utf8_w_bom = unicode_str.encode('utf-8-sig')
utf16_w_bom = unicode_str.encode('utf16')
utf8_wo_bom = unicode_str.encode('utf-8')
utf16_wo_bom = unicode_str.encode('utf-16le')
print repr(utf8_w_bom)
print repr(utf16_w_bom)
print repr(utf8_wo_bom)
print repr(utf16_wo_bom)
print repr(utf8_w_bom.decode('utf-8-sig'))
print repr(utf16_w_bom.decode('utf16'))
print repr(utf8_wo_bom.decode('utf-8-sig'))
print repr(utf16_wo_bom.decode('utf16'))
Output:
'\xef\xbb\xbftest'
'\xff\xfet\x00e\x00s\x00t\x00'
'test'
't\x00e\x00s\x00t\x00'
u'test'
u'test'
u'test'
u'test'
Note that on decode utf16
will assume the native byte order if there is no BOM.
Upvotes: 0
Reputation: 17900
I think you have a unicode string object. (If you're using Python 3 you certainly do, since it's the only kind of string.) Your as_hex function isn't printing out "fe" for the first character and "ff" for the second. It's printing out "feff" for the first unicode character in the string. For example (Python 3):
>>> mystr = "\ufeffHello world."
>>> mystr[0]
'\ufeff'
>>> '%02x' % ord(mystr[0])
'feff'
You either need to remove just one unicode character, or to store your string in a bytes
object instead and remove two bytes.
(This doesn't explain why len(bom) is 2, and I can't tell without seeing more of your code. I'd guess that bom is a list
or a bytes
object, not a unicode string.)
My answer above assumes Python 3, but I've realised from your print statements that you're using Python 2. Based on that, I'd guess that bom
is an ASCII string while string
is a unicode string. If you use print repr(x)
instead of print x
it will let you tell the difference between unicode and ASCII strings.
Upvotes: 4