Reputation: 11
I'm trying to extract the binary opcodes from listing file generated via /Fa flag in visual studio. The format look like:
00040 8b 45 bc mov eax, DWORD PTR _i$2535[ebp]
00043 3b 45 c8 cmp eax, DWORD PTR _code_section_size$[ebp]
00046 73 19 jae SHORT $LN1@unpacker_m
When the first number is address, then we have opcodes and then the instruction mnemonic, in such case I'd like to get an array of:
8b 45 bc 3b 45 c8 73 19
First I split the line and then run the following regular expression to get bytes:
HEX_BYTE = re.compile("\s*[\da-fA-F]{2}\s*", re.IGNORECASE)
But this regex match everything, someone have an idea how to do this in a simple way? Thanks David
Upvotes: 1
Views: 397
Reputation: 43169
A Python example with the help of regular expressions:
import re
string = """00040 8b 45 bc mov eax, DWORD PTR _i$2535[ebp]
00043 3b 45 c8 cmp eax, DWORD PTR _code_section_size$[ebp]
00046 73 19 jae SHORT $LN1@unpacker_m"""
bytes = map(str.strip, re.findall(r'((?:\b[\da-fA-F]{2}\b\s+)+)', string))
print bytes
# ['8b 45 bc', '3b 45 c8', '73 19']
Upvotes: 0
Reputation: 87114
Looking at the file sample in the question it appears to consist of fixed width fields, so you should be able to extract those values using fixed offsets into each line:
with open('listing.txt') as listing:
opcodes = [opcode for line in listing for opcode in line[8:16].split()]
>>> opcodes
['8b', '45', 'bc', '3b', '45', 'c8', '73', '19']
The above uses a list comprehension to pluck out the required fields which are known to exist in positions 8 through 16 using nothing but a slice operation and a split()
. This ought to be a great deal faster than a regular expression, and it is a great deal more readable.
If you want the opcodes as integers:
with open('listing.txt') as listing:
opcodes = [int(opcode, 16) for line in listing for opcode in line[8:16].split()]
>>> opcodes
[139, 69, 188, 59, 69, 200, 115, 25]
Upvotes: 0
Reputation: 43507
Forget regexp, it is over-complicated for extracting data from fixed fields. The statements
line = ' 00043 3b 45 c8 cmp eax,'
print(line[7:19].split())
yield
['3b', '45', 'c8']
You might need to
line = line.expandtabs()
first if there are Tab characters in the input strings.
Upvotes: 3
Reputation: 177
You could try this one: \s[\da-fA-F]{2}\s[\da-fA-F]{2}(\s[\da-fA-F]{2})?
It would return three results for your example:
" 8b 45 bc"
" 3b 45 c8"
" 73 19"
You would have to split them with space and then you have the same result as you described.
Upvotes: 0