David
David

Reputation: 11

python regular expression, extract bytes from listing output

I'm trying to extract the binary opcodes from listing file generated via /Fa flag in visual studio. The format look like:

00040   8b 45 bc     mov     eax, DWORD PTR _i$2535[ebp]
  00043 3b 45 c8     cmp     eax, DWORD PTR _code_section_size$[ebp]
  00046 73 19        jae     SHORT $LN1@unpacker_m

When the first number is address, then we have opcodes and then the instruction mnemonic, in such case I'd like to get an array of:

8b 45 bc 3b 45 c8 73 19

First I split the line and then run the following regular expression to get bytes:

HEX_BYTE = re.compile("\s*[\da-fA-F]{2}\s*", re.IGNORECASE)

But this regex match everything, someone have an idea how to do this in a simple way? Thanks David

Upvotes: 1

Views: 397

Answers (4)

Jan
Jan

Reputation: 43169

A Python example with the help of regular expressions:

import re
string = """00040   8b 45 bc     mov     eax, DWORD PTR _i$2535[ebp]
  00043 3b 45 c8     cmp     eax, DWORD PTR _code_section_size$[ebp]
  00046 73 19        jae     SHORT $LN1@unpacker_m"""

bytes = map(str.strip, re.findall(r'((?:\b[\da-fA-F]{2}\b\s+)+)', string))
print bytes
# ['8b 45 bc', '3b 45 c8', '73 19']

Upvotes: 0

mhawke
mhawke

Reputation: 87114

Looking at the file sample in the question it appears to consist of fixed width fields, so you should be able to extract those values using fixed offsets into each line:

with open('listing.txt') as listing:
    opcodes = [opcode for line in listing for opcode in line[8:16].split()]

>>> opcodes
['8b', '45', 'bc', '3b', '45', 'c8', '73', '19']

The above uses a list comprehension to pluck out the required fields which are known to exist in positions 8 through 16 using nothing but a slice operation and a split(). This ought to be a great deal faster than a regular expression, and it is a great deal more readable.

If you want the opcodes as integers:

with open('listing.txt') as listing:
    opcodes = [int(opcode, 16) for line in listing for opcode in line[8:16].split()]

>>> opcodes
[139, 69, 188, 59, 69, 200, 115, 25]

Upvotes: 0

msw
msw

Reputation: 43507

Forget regexp, it is over-complicated for extracting data from fixed fields. The statements

line = '  00043 3b 45 c8     cmp     eax,'
print(line[7:19].split())

yield

['3b', '45', 'c8']

You might need to

line = line.expandtabs()

first if there are Tab characters in the input strings.

Upvotes: 3

JonnyTieM
JonnyTieM

Reputation: 177

You could try this one: \s[\da-fA-F]{2}\s[\da-fA-F]{2}(\s[\da-fA-F]{2})?

It would return three results for your example:

" 8b 45 bc"

" 3b 45 c8"

" 73 19"

You would have to split them with space and then you have the same result as you described.

Upvotes: 0

Related Questions