Reputation: 165
Given the following bytestring, how can I remove any characters matching \xFF, and create a list object from what's left (by splitting on removed areas)?
b"\x07\x00\x00\x00~\x10\x00pts/5\x00\x00/5\x00\x00user\x00\x00"
Desired result:
["~", "pts/5", "/5", "user"]
The above string is just an example - I'd like to remove any \x.. (non-decoded) bytes.
I'm using Python 3.2.3, and would prefer to use standard libraries only.
Upvotes: 1
Views: 532
Reputation: 336348
>>> a = b"\x07\x00\x00\x00~\x10\x00pts/5\x00\x00/5\x00\x00user\x00\x00"
>>> import re
>>> re.findall(rb"[^\x00-\x1f\x7f-\xff]+", a)
[b'~', b'pts/5', b'/5', b'user']
The results are still bytes
objects. If you want the results to be strings:
>>> [i.decode("ascii") for i in re.findall(rb"[^\x00-\x1f\x7f-\xff]+", a)]
['~', 'pts/5', '/5', 'user']
Explanation:
[^\x00-\x1f\x7f-\xff]+
matches one or more (+
) characters that are not in the range ([^...]
) between ASCII 0 and 31 (\x00-\x1F
) or between ASCII 127 and 255 (\x7f-\xff
).
Be aware that this approach only works if the "embedded texts" are ASCII. It will remove all extended alphabetic characters (like ä
, é
, €
etc.) from strings encoded in an 8-bit codepage like latin-1
, and it will effectively destroy strings encoded in UTF-8
and other Unicode encodings because those do contain byte values between 0 and 31/127 and 255 as parts of their character codes.
Of course, you can always manually fine-tune the exact ranges you want to remove according to the example given in this answer.
Upvotes: 1