therealjayvi
therealjayvi

Reputation: 171

Python3 re.findall occurrences within bytes object, using concatenation of specific bytes object + regex pattern as search parameter

Not entirely sure if I've worded that correctly, but here's what I'm trying to do.

I have a file which I typically open in a GUI hex editor, make a few modifications, then save and exit. I've been looking to figure out how to automate this process entirely with Python. I can't seem to get my regex search pattern correct, hopefully somebody can take a moment to see why not?

import binascii, re
infile = my_file.bin
with open(infile, "rb") as f:
    data = binascii.b2a_hex(f.read()).upper()

for matches in list(data):
    match_list = []
    matches = re.findall(b'\x24' + b'\x([A-Z]).{3,10}', data)
    match_list.append(matches)

The problem I have is trying to use a special sequence in place of a hex character, since there are many sequences within the original file that I manually search for in order to make the modifications. All sequences begin with '$' in hex ('\x24'), though not all sequences have a similar length; they all have at least 3 following characters, and I want to ensure I catch them all which explains the {3,10}.

Ideally outputting these found sequences into a list for reference, and then creating a dictionary containing the sequence found, paired with the offset it was found at is the end goal. I've extensively looked through page after page of docs trying to find an understandable way to go about this, and I think it can be achieved with the re.groupdict function, though Im at a loss at this point. Any advice/help is appreciated.

[EDIT] Just found a similar question here, though I still feel my situation is different in that my regex pattern uses a special sequence instead of a static search.

Upvotes: 0

Views: 3074

Answers (1)

Serge Ballesta
Serge Ballesta

Reputation: 149075

You have no reason to convert anything into hex, Python re module can easily search in raw byte strings. But you really should loop with search instead of using findall in order to get the offsets where the strings are found.

The code could become:

import re
infile = "my_file.bin"
with open(infile, "rb") as f:
    data = f.read()

matches = []                # initializes the list for the matches
curpos = 0                  # current search position (starts at beginning)
pattern = re.compile(br'\$[A-Z]{3,10}')   # the pattern to search
while True:
    m = pattern.search(data[curpos:])     # search next occurence
    if m is None: break                   # no more could be found: exit loop
    matches.append(curpos + m.start(), m.group(0)) # append a pair (pos, string) to matches
    curpos += m.end()          # next search will start after the end of found string

Upvotes: 2

Related Questions