Regex on bytestring in Python 3

Question

I am using RegEx to match BGP messages in a byte string. An example byte string is looking like this:

b'\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\x00\x13\x04\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\x00\x13\x04'

\xff (8 times) is used as "magic marker" to start a single message. Now I want to split the messages to parse each of them.

messages = re.split(b'\xff{8}', payload)

Matching works fine but I got some empty fields in my messages array.

b'' 
b''
b'001304'
b''
b''
b'001304'

Can someone explain this behavior? Why are there two empty fields between each (correct splitted) message. In larger byte strings sometimes there is just one empty field between each messages.

Wiktor Stribiżew · Accepted Answer

I think you want to match 8 occurrences of \xff, not just 8 trailing fs (e.g. \xfffffffff):

messages = re.split(b'(?:\xff){8}', payload)
                      ^^^    ^

Also, there are just more than one 8 consecutive \xffs in your string on end. You might want to use

messages = re.split(b'(?:(?:\xff){8})+', payload)

However, that will still result in having an empty first element if the match is found at the start of the data. You may remove the part at the beginning before splitting:

messages = re.split(b'(?:(?:\xff){8})+', re.sub(b'^(?:(?:\xff){8})+', b'', payload))

HOWEVER, the best idea is to just remove the empty elements with a list comprehension or with Filter (kudos for testing goes to you):

messages = [x for x in re.split(b'(?:\xff){8}', payload) if x]
# Or, the fastest way here as per the comments
messages = list(filter(None, messages))

See an updated Python 3 demo

Regex on bytestring in Python 3

Answers (1)

Related Questions