Reputation:
I have a question that includes various steps.
I am parsing a file that looks like this:
9
123
0 987
3 890 234 111
1 0 1 90 1 34 1 09 1 67
1 684321
2 352 69
1 1 1 243 1 198 1 678 1 11
2 098765
1 143
1 2 1 23 1 63 1 978 1 379
3 784658
1 43
1 3 1 546 1 789 1 12 1 098
I want to make this lines in the file, keys of a dictionary (ignoring the first number and just taking the second one, because it just indicates which number of key should be):
0 987
1 684321
2 098765
3 784658
And this lines, the values of the elements (ignoring only the first number too, because it just indicates how many elements are):
3 890 234 111
2 352 69
1 143
1 43
So at the end it has to look like this:
d = {987 : [890, 234, 111], 684321 : [352, 69],
098765 : [143], 784658 : [43]}
So far I have this:
findkeys = re.findall(r"\d\t(\d+)\n", line)
findelements = re.findall(r"\d\t(\d+)", line)
listss.append("".join(findelements))
d = {findkeys: listss}
The regular expressions need more exceptions because the one for the keys, it gives me the elements of other lines that I don't want them to be keys, but have just one number too. Like in the example of the file, the number 43 appears as a result.
And the regular expression of the elements gives me back all the lines.
I don´t know if it will be easier to make that the code should ignore the lines of which I do not need information, but I don't know how to do that.
I want it to keep it has simple has possible. Thanks!
Upvotes: 0
Views: 79
Reputation: 42143
Once you have the lines in a list (lines variable), you can simply use re to isolate numbers and dictionary/list comprehension to build the desired data structure.
Based on you example data, every 3rd line is a key with values on the following line. This means you only need to stride by 3 in the list.
findall() will give you the list of numbers (as text) on each line and you can ignore the first one with simple subscripts.
import re
value = re.compile(r"(\d+)")
numbers = [ [int(v) for v in value.findall(line)] for line in lines]
intDict = { key[1]:values[1:] for key,values in zip(numbers[2::3],numbers[3::3]) }
You could also do it using split() but then you have to exclude empty entries that multiple spaces will create in the split:
numbers = [ [int(v) for v in line.split() if v != ""] for line in lines]
intDict = { key[1]:values[1:] for key,values in zip(numbers[2::3],numbers[3::3]) }
Upvotes: 1
Reputation: 43169
You could build yourself a parser with e.g. parsimonious
:
from parsimonious.nodes import NodeVisitor
from parsimonious.grammar import Grammar
data = """
9
123
0 987
3 890 234 111
1 0 1 90 1 34 1 09 1 67
1 684321
2 352 69
1 1 1 243 1 198 1 678 1 11
2 098765
1 143
1 2 1 23 1 63 1 978 1 379
3 784658
1 43
1 3 1 546 1 789 1 12 1 098
"""
grammar = Grammar(
r"""
data = (important / garbage)+
important = keyline newline valueline
garbage = ~".*" newline?
keyline = ws number ws number
valueline = (ws number)+
newline = ~"[\n\r]"
number = ~"\d+"
ws = ~"[ \t]+"
"""
)
tree = grammar.parse(data)
class DataVisitor(NodeVisitor):
output = {}
current = None
def generic_visit(self, node, visited_children):
return node.text or visited_children
def visit_keyline(self, node, children):
key = node.text.split()[-1]
self.current = key
def visit_valueline(self, node, children):
values = node.text.split()
self.output[self.current] = [int(x) for x in values[1:]]
dv = DataVisitor()
dv.visit(tree)
print(dv.output)
This yields
{'987': [890, 234, 111], '684321': [352, 69], '098765': [143], '784658': [43]}
The idea here is that every "keyline" is only composed of two numbers with the second being the soon-to-be keyword. The next line is the valueline.
Upvotes: 0
Reputation: 168
with open('filename.txt') as f:
lines = f.readlines()
lines = [x.strip() for x in lines]
lines = lines[2:]
keys = lines[::3]
values = lines[1::3]
output lines:
['0 987',
'3 890 234 111',
'1 0 1 90 1 34 1 09 1 67',
'1 684321',
'2 352 69',
'1 1 1 243 1 198 1 678 1 11',
'2 098765',
'1 143',
'1 2 1 23 1 63 1 978 1 379',
'3 784658',
'1 43',
'1 3 1 546 1 789 1 12 1 098']
output keys:
['0 987', '1 684321', '2 098765', '3 784658']
output values:
['3 890 234 111', '2 352 69', '1 143', '1 43']
Now you just have to put it together ! Iterate through keys and values.
Upvotes: 1