Reputation: 868

Matching variable number of occurrences of token using regex in python

I am trying to match a token multiple times, but I only get back the last occurrence, which I understand is the normal behavior as per this answer, but I haven't been able to get the solution presented there in my example.

My text looks something like this:

&{dict1_name}=   key1=key1value   key2=key2value
&{dict2_name}=   key1=key1value

So basically multiple lines, each with a starting string, spaces, then a variable number of key pairs. If you are wondering where this comes from, it is a robot framework variables file that I am trying to transform into a python variables file.

I will be iterating per line to match the key pairs and construct a python dictionary from them.

My current regex pattern is:

&{([^ ]+)}=[ ]{2,}(?:[ ]{2,}([^\s=]+)=([^\s=]+))+

This correctly gets me the dict name but the key pairs only match the last occurrence, as mentioned above. How can I get it to return a tuple containing: ("dict1_name","key1","key1value"..."keyn","keynvalue") so that I can then iterate over this and construct the python dictionary like so:

dict1_name= {"key1": "key1value",..."keyn": "keynvalue"}

Thanks!

Upvotes: 2

Answers (4)

LoveToCode

Reputation: 868

Building off of Brad's answer, I made some modifications. As mentioned in my comment on his reply, it failed at empty lines or comment lines. I modified it to ignore these and continue. I also added handling of spaces: it now matches spaces in dictionary names but replaces them with underscore since python cannot have spaces in variable names. Keys are left untouched since they are strings.

import re


    def robot_to_python(filename):
        """
        This function can be used to convert robot variable files containing dicts to a python
        variables file containing python dict that can be imported by both python and robot.
        """
        dname = re.compile(r"^&{(?P<name>.+)}=")
        keyval = re.compile(r"(?P<key>[\w|:]+)=(?P<val>[\w|:]+)")

        data = {}
        with open(filename + '.robot') as f:
            for line in f:
                n = dname.search(line)
                if n:
                    name = dname.search(line).group("name").replace(" ", "_")

                    if name:
                        data[name] = dict(keyval.findall(line))

        with open(filename + '.py', 'w') as file:
            for dictionary in data.items():
                dict_name = dictionary[0]
                file.write(dict_name + " = { \n")
                keyvals = dictionary[1]
                for k in sorted(keyvals.keys()):
                    file.write("'%s':'%s', \n" % (k, keyvals[k]))
                file.write("}\n\n")
        file.close()

Upvotes: 0

Brad Solomon

Reputation: 40888

As you point out, you will need to work around the fact that capture groups will only catch the last match. One way to do so is to take advantage of the fact that lines in a file are iterable, and to use two patterns: one for the "line name", and one for its multiple keyvalue pairs:*

import re

dname = re.compile(r'^&{(?P<name>\w+)}=')
keyval = re.compile(r'(?P<key>\w+)=(?P<val>\w+)')

data = {}
with open('input/keyvals.txt') as f:
    for line in f:
        name = dname.search(line)
        if name:
            name = name.group('name')
            data[name] = dict(keyval.findall(line))

_{*Admittedly, this is a tad inefficient since you're conducting two searches per line. But for moderately sized files, you should be fine.}

Result:

>>> from pprint import pprint
>>> pprint(data)
{'d5': {'key1': '28f_s', 'key2': 'key2value'},
 'name1': {'key1': '5', 'key2': 'x'},
 'othername2': {'key1': 'key1value', 'key2': '7'}}

Note that \w matches Unicode word characters.

Sample input, keyvals.txt:

&{name1}=   key1=5   key2=x
&{othername2}=   key1=key1value   key2=7
&{d5}=   key1=28f_s   key2=aaa key2=key2value

Upvotes: 2

Jan

Reputation: 43169

Use two expressions in combination with a dict comprehension:

import re

junkystring = """
lorem ipsum
&{dict1_name}=   key1=key1value   key2=key2value
&{dict2_name}=   key1=key1value
lorem ipsum
"""

rx_outer = re.compile(r'^&{(?P<dict_name>[^{}]+)}(?P<values>.+)', re.M)
rx_inner = re.compile(r'(?P<key>\w+)=(?P<value>\w+)')

result = {m_outer.group('dict_name'): {m_inner.group('key'): m_inner.group('value')
            for m_inner in rx_inner.finditer(m_outer.group('values'))}
            for m_outer in rx_outer.finditer(junkystring)}

print(result)

Which produces

{'dict1_name': {'key1': 'key1value', 'key2': 'key2value'}, 
 'dict2_name': {'key1': 'key1value'}}

With the two expressions being

^&{(?P<dict_name>[^{}]+)}(?P<values>.+)
# the outer format

See a demo on regex101.com. And the second

(?P<key>\w+)=(?P<value>\w+)
# the key/value pairs

See a demo for the latter on regex101.com as well.

The rest is simply sorting the different expressions in the dict comprehension.

Upvotes: 1

Dani Mesejo

Reputation: 61910

You could use two regexes one for the names and other for the items, applying the one for the items after the first space:

import re

lines = ['&{dict1_name}=   key1=key1value   key2=key2value',
         '&{dict2_name}=   key1=key1value']

name = re.compile('^&\{(\w+)\}=')
item = re.compile('(\w+)=(\w+)')

for line in lines:
    n = name.search(line).group(1)
    i = '{{{}}}'.format(','.join("'{}' : '{}'".format(m.group(1), m.group(2)) for m in item.finditer(' '.join(line.split()[1:]))))
    exec('{} = {}'.format(n, i))
    print(locals()[n])

Output

{'key2': 'key2value', 'key1': 'key1value'}
{'key1': 'key1value'}

Explanation

The '^&\{(\w+)\}=' matches an '&' followed by a word (\w+) surrounded by curly braces '\{', '\}'. The second regex matches any words joined by a '='. The line:

i = '{{{}}}'.format(','.join("'{}' : '{}'".format(m.group(1), m.group(2)) for m in item.finditer(' '.join(line.split()[1:]))))

creates a dictionary literal, finally you create a dictionary with the required name using exec. You can access the value of the dictionary querying locals.

Upvotes: 1

Matching variable number of occurrences of token using regex in python

Answers (4)

Related Questions