Nate
Nate

Reputation: 105

Regex matching of a bytes pattern gives unusual results - '.' not equivalent to [\x00-\xff]

I have written a program to match all occurrences of a specific pattern of binary data (working in hexadecimal) amidst random other data. It occurs any number of times greater than 0 in a file, at any location. Here is the code I am using to do the search, where f has already been opened in read/write mode:

pattern = #pattern goes here
f.seek(0)
bytechain = f.read()
match_iter = re.compile(pattern).finditer(bytechain)
matches = [x.start() for x in match_iter]

Here is an example of one of the strings I'm trying to match:

b'\xD4\x00\x00\x00\x3C\x13\x00\x00\x4D\x0D\x78\x0A\x5C\x00'

a.k.a.

b'\xD4\x00\x00\x00<\x13\x00\x00M\x0Dx\x0A\\x00'

Some of these values change, so I have to use dots to represent them in the regex pattern.

I have noticed that this pattern does not work (the 2 dots at the end fail to match, as in the pattern matches up until those 2 dots are added and then it fails to match):

pattern = b'\xD4[\x00]{3}..[\x00]{2}M...[\x5a-\x7f]'

But when the pattern is changed to this, it matches as expected:

pattern = b'\xD4[\x00]{3}[\x00-\xff]{2}[\x00]{2}M..[\x00-\xff][\x5a-\x7f]'

Basically, it would appear that the byte b'\x5C' is not matched by '.', but it is matched by '[\x00-\xff]'!

What gives? I had thought that these would be equivalent for this data. There's something I don't understand about how these patterns are compiling. Can someone more experienced with regex help me out? I am not a programmer by trade but understanding this would help me improve this program.

Thanks in advance.

Upvotes: 2

Views: 869

Answers (1)

Jean-Fran&#231;ois Fabre
Jean-Fran&#231;ois Fabre

Reputation: 140256

it's the same rule for bytes, you have to use re.DOTALL when using dots if you want to match all characters including newline

match_iter = re.compile(pattern,flags=re.DOTALL).finditer(bytechain)

Bad luck you have x0A in place of your last dot, which is newline.

re.DOTALL

Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline. Corresponds to the inline flag (?s).

Upvotes: 4

Related Questions