Reputation: 33
I'm writing a program to generate tokens from a source code in assembly but I have a strange problem.
Sometimes the code works as expected and sometimes not!
Here is the code (the variables are in Portuguese but I put a translation):
import re
def tokenize(code):
tokens = []
tokens_re = {
'comentarios' : '(//.*)', # comentary
'linhas' : '(\n)', # lines
'instrucoes' : '(add)', # instructions
'numeros_hex' : '([-+]?0x[0-9a-fA-F]+)', # hex numbers
'numeros_bin' : '([-+]?0b[0-1]+)', # binary numbers
'numeros_dec' : '([-+]?[0-9]+)'} # decimal numbers
#'reg32' : 'eax|ebx|ecx|edx|esp|ebp|eip|esi',
#'reg16' : 'ax|bx|cx|dx|sp|bp|ip|si',
#'reg8' : 'ah|al|bh|bl|ch|cl|dh|dl'}
pattern = re.compile('|'.join(list(tokens_re.values())))
scan = pattern.scanner(code)
while 1:
m = scan.search()
if not m:
break
tipo = list(tokens_re.keys())[m.lastindex-1] # type
valor = repr(m.group(m.lastindex)) # value
if tipo == 'linhas':
print('')
else:
print(tipo, valor)
return tokens
code = '''
add eax, 5 //haha
add ebx, -5
add eax, 1234
add ebx, 1234
add ax, 0b101
add bx, -0b101
add al, -0x5
add ah, 0x5
'''
print(tokenize(code))
And here the expected result:
instrucoes 'add'
numeros_dec '5'
comentarios '//haha'
instrucoes 'add'
numeros_dec '-5'
instrucoes 'add'
numeros_dec '1234'
instrucoes 'add'
numeros_dec '1234'
instrucoes 'add'
numeros_bin '0b101'
instrucoes 'add'
numeros_bin '-0b101'
instrucoes 'add'
numeros_hex '-0x5'
instrucoes 'add'
numeros_hex '0x5'
The problem is that with no change in code, sometimes it gives the expected result but sometimes it's this:
instrucoes 'add'
numeros_dec '5'
comentarios '//haha'
instrucoes 'add'
numeros_dec '-5'
instrucoes 'add'
numeros_dec '1234'
instrucoes 'add'
numeros_dec '1234'
instrucoes 'add'
numeros_dec '0'
numeros_dec '101'
instrucoes 'add'
numeros_dec '-0'
numeros_dec '101'
instrucoes 'add'
numeros_dec '-0'
numeros_dec '5'
instrucoes 'add'
numeros_dec '0'
numeros_dec '5'
Where is the problem?
Upvotes: 3
Views: 51
Reputation: 28606
You build your regex from a dictionary. Dictionaries aren't ordered, so the regex pattern can differ from time to time and thus produce different results.
If you want "stable" results, I suggest you either use sorted(tokens_re.values())
or specify them in a list/tuple rather than the dictionary.
For example you can specify them as list of pairs and then use that list to build the pattern as well as also build the dictionary:
tokens_re = [
('comentarios', '(//.*)'), # comentary
('linhas', '(\n)'), # lines
('instrucoes', '(add)'), # instructions
('numeros_hex', '([-+]?0x[0-9a-fA-F]+)'), # hex numbers
('numeros_bin', '([-+]?0b[0-1]+)'), # binary numbers
('numeros_dec', '([-+]?[0-9]+)'), # decimal numbers
]
pattern = re.compile('|'.join(p for _, p in tokens_re))
tokens_re = dict(tokens_re)
Upvotes: 5