Reputation: 105
I have written a program to match all occurrences of a specific pattern of binary data (working in hexadecimal) amidst random other data. It occurs any number of times greater than 0 in a file, at any location. Here is the code I am using to do the search, where f has already been opened in read/write mode:
pattern = #pattern goes here
f.seek(0)
bytechain = f.read()
match_iter = re.compile(pattern).finditer(bytechain)
matches = [x.start() for x in match_iter]
Here is an example of one of the strings I'm trying to match:
b'\xD4\x00\x00\x00\x3C\x13\x00\x00\x4D\x0D\x78\x0A\x5C\x00'
a.k.a.
b'\xD4\x00\x00\x00<\x13\x00\x00M\x0Dx\x0A\\x00'
Some of these values change, so I have to use dots to represent them in the regex pattern.
I have noticed that this pattern does not work (the 2 dots at the end fail to match, as in the pattern matches up until those 2 dots are added and then it fails to match):
pattern = b'\xD4[\x00]{3}..[\x00]{2}M...[\x5a-\x7f]'
But when the pattern is changed to this, it matches as expected:
pattern = b'\xD4[\x00]{3}[\x00-\xff]{2}[\x00]{2}M..[\x00-\xff][\x5a-\x7f]'
Basically, it would appear that the byte b'\x5C' is not matched by '.', but it is matched by '[\x00-\xff]'!
What gives? I had thought that these would be equivalent for this data. There's something I don't understand about how these patterns are compiling. Can someone more experienced with regex help me out? I am not a programmer by trade but understanding this would help me improve this program.
Thanks in advance.
Upvotes: 2
Views: 869
Reputation: 140256
it's the same rule for bytes
, you have to use re.DOTALL
when using dots if you want to match all characters including newline
match_iter = re.compile(pattern,flags=re.DOTALL).finditer(bytechain)
Bad luck you have x0A
in place of your last dot, which is newline.
re.DOTALL
Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline. Corresponds to the inline flag (?s).
Upvotes: 4