Reputation:

Obtaining a dictionary out of regular expressions

I have a question that includes various steps.

I am parsing a file that looks like this:

 9
 123
 0  987
 3  890 234 111
 1 0    1 90    1 34    1 09    1 67    
 1  684321
 2  352 69
 1 1    1 243   1 198   1 678   1 11
 2  098765
 1  143
 1 2    1 23    1 63    1 978   1 379   
 3  784658
 1  43
 1 3    1 546   1 789   1 12    1 098

I want to make this lines in the file, keys of a dictionary (ignoring the first number and just taking the second one, because it just indicates which number of key should be):

And this lines, the values of the elements (ignoring only the first number too, because it just indicates how many elements are):

 3  890 234 111
 2  352 69
 1  143
 1  43

So at the end it has to look like this:

   d = {987 : [890, 234, 111], 684321 : [352, 69], 
         098765 : [143], 784658 : [43]}

So far I have this:

findkeys = re.findall(r"\d\t(\d+)\n", line)
findelements = re.findall(r"\d\t(\d+)", line)

listss.append("".join(findelements))
d = {findkeys: listss}

The regular expressions need more exceptions because the one for the keys, it gives me the elements of other lines that I don't want them to be keys, but have just one number too. Like in the example of the file, the number 43 appears as a result.

And the regular expression of the elements gives me back all the lines.

I don´t know if it will be easier to make that the code should ignore the lines of which I do not need information, but I don't know how to do that.

I want it to keep it has simple has possible. Thanks!

Upvotes: 0

Answers (3)

Alain T.

Reputation: 42139

Once you have the lines in a list (lines variable), you can simply use re to isolate numbers and dictionary/list comprehension to build the desired data structure.

Based on you example data, every 3rd line is a key with values on the following line. This means you only need to stride by 3 in the list.

findall() will give you the list of numbers (as text) on each line and you can ignore the first one with simple subscripts.

import re
value   = re.compile(r"(\d+)")
numbers = [ [int(v) for v in value.findall(line)] for line in lines]
intDict = { key[1]:values[1:] for key,values in zip(numbers[2::3],numbers[3::3]) }

You could also do it using split() but then you have to exclude empty entries that multiple spaces will create in the split:

numbers = [ [int(v) for v in line.split() if v != ""] for line in lines]
intDict = { key[1]:values[1:] for key,values in zip(numbers[2::3],numbers[3::3]) }

Upvotes: 1

Jan

Reputation: 43189

You could build yourself a parser with e.g. parsimonious:

from parsimonious.nodes import NodeVisitor
from parsimonious.grammar import Grammar

data = """
 9
 123
 0  987
 3  890 234 111
 1 0    1 90    1 34    1 09    1 67    
 1  684321
 2  352 69
 1 1    1 243   1 198   1 678   1 11
 2  098765
 1  143
 1 2    1 23    1 63    1 978   1 379   
 3  784658
 1  43
 1 3    1 546   1 789   1 12    1 098   
"""
grammar = Grammar(
    r"""
    data        = (important / garbage)+
    important   = keyline newline valueline
    garbage     = ~".*" newline?
    keyline     = ws number ws number
    valueline   = (ws number)+
    newline     = ~"[\n\r]"
    number      = ~"\d+"
    ws          = ~"[ \t]+"
    """
)

tree = grammar.parse(data)

class DataVisitor(NodeVisitor):
    output = {}
    current = None

    def generic_visit(self, node, visited_children):
        return node.text or visited_children

    def visit_keyline(self, node, children):
        key = node.text.split()[-1]
        self.current = key

    def visit_valueline(self, node, children):
        values = node.text.split()
        self.output[self.current] = [int(x) for x in values[1:]]

dv = DataVisitor()
dv.visit(tree)
print(dv.output)

This yields

{'987': [890, 234, 111], '684321': [352, 69], '098765': [143], '784658': [43]}

The idea here is that every "keyline" is only composed of two numbers with the second being the soon-to-be keyword. The next line is the valueline.

Upvotes: 0

Kristóf Varga

Reputation: 168

with open('filename.txt') as f:
    lines = f.readlines()   
lines = [x.strip() for x in lines]
lines = lines[2:]
keys = lines[::3]
values = lines[1::3]

output lines:

['0  987',
 '3  890 234 111',
 '1 0    1 90    1 34    1 09    1 67',
 '1  684321',
 '2  352 69',
 '1 1    1 243   1 198   1 678   1 11',
 '2  098765',
 '1  143',
 '1 2    1 23    1 63    1 978   1 379',
 '3  784658',
 '1  43',
 '1 3    1 546   1 789   1 12    1 098']

output keys:

['0  987', '1  684321', '2  098765', '3  784658']

output values:

['3  890 234 111', '2  352 69', '1  143', '1  43']

Now you just have to put it together ! Iterate through keys and values.

Upvotes: 1

Obtaining a dictionary out of regular expressions

Answers (3)

Related Questions