TiberSeptim
TiberSeptim

Reputation: 33

Strange behavior regular expressions

I'm writing a program to generate tokens from a source code in assembly but I have a strange problem.

Sometimes the code works as expected and sometimes not!

Here is the code (the variables are in Portuguese but I put a translation):

import re

def tokenize(code):
    tokens = []

    tokens_re = {
    'comentarios'  : '(//.*)',                         # comentary
    'linhas'       : '(\n)',                           # lines
    'instrucoes'   : '(add)',                          # instructions
    'numeros_hex'  : '([-+]?0x[0-9a-fA-F]+)',          # hex numbers
    'numeros_bin'  : '([-+]?0b[0-1]+)',                # binary numbers
    'numeros_dec'  : '([-+]?[0-9]+)'}                  # decimal numbers

    #'reg32'        : 'eax|ebx|ecx|edx|esp|ebp|eip|esi',
    #'reg16'        : 'ax|bx|cx|dx|sp|bp|ip|si',
    #'reg8'         : 'ah|al|bh|bl|ch|cl|dh|dl'}

    pattern = re.compile('|'.join(list(tokens_re.values())))
    scan = pattern.scanner(code)

    while 1:
        m = scan.search()
        if not m:
            break

        tipo = list(tokens_re.keys())[m.lastindex-1]     # type
        valor = repr(m.group(m.lastindex))               # value

        if tipo == 'linhas':
            print('')

        else:
            print(tipo, valor)

    return tokens



code = '''
add eax, 5 //haha
add ebx, -5
add eax, 1234
add ebx, 1234
add ax, 0b101
add bx, -0b101
add al, -0x5
add ah, 0x5
'''

print(tokenize(code))

And here the expected result:

instrucoes 'add'
numeros_dec '5'
comentarios '//haha'

instrucoes 'add'
numeros_dec '-5'

instrucoes 'add'
numeros_dec '1234'

instrucoes 'add'
numeros_dec '1234'

instrucoes 'add'
numeros_bin '0b101'

instrucoes 'add'
numeros_bin '-0b101'

instrucoes 'add'
numeros_hex '-0x5'

instrucoes 'add'
numeros_hex '0x5'

The problem is that with no change in code, sometimes it gives the expected result but sometimes it's this:

instrucoes 'add'
numeros_dec '5'
comentarios '//haha'

instrucoes 'add'
numeros_dec '-5'

instrucoes 'add'
numeros_dec '1234'

instrucoes 'add'
numeros_dec '1234'

instrucoes 'add'
numeros_dec '0'
numeros_dec '101'

instrucoes 'add'
numeros_dec '-0'
numeros_dec '101'

instrucoes 'add'
numeros_dec '-0'
numeros_dec '5'

instrucoes 'add'
numeros_dec '0'
numeros_dec '5'

Where is the problem?

Upvotes: 3

Views: 51

Answers (1)

Stefan Pochmann
Stefan Pochmann

Reputation: 28606

You build your regex from a dictionary. Dictionaries aren't ordered, so the regex pattern can differ from time to time and thus produce different results.

If you want "stable" results, I suggest you either use sorted(tokens_re.values()) or specify them in a list/tuple rather than the dictionary.

For example you can specify them as list of pairs and then use that list to build the pattern as well as also build the dictionary:

tokens_re = [
    ('comentarios', '(//.*)'),                         # comentary
    ('linhas',      '(\n)'),                           # lines
    ('instrucoes',  '(add)'),                          # instructions
    ('numeros_hex', '([-+]?0x[0-9a-fA-F]+)'),          # hex numbers
    ('numeros_bin', '([-+]?0b[0-1]+)'),                # binary numbers
    ('numeros_dec', '([-+]?[0-9]+)'),                  # decimal numbers
]
pattern = re.compile('|'.join(p for _, p in tokens_re))
tokens_re = dict(tokens_re)

Upvotes: 5

Related Questions