Reputation: 5252
I have a large source code file that is structured using spaces. I am only interested in one specific part of the structured text file, which looks like this:
SP : STRUCT
Spare : STRUCT //Spare
Val : INT := 100;
UpLim : INT := 100;
LoLim : INT ;
Def : INT := 100;
Prot : INT := 2;
END_STRUCT ;
END_STRUCT ;
As you can see, there is an 'SP' structure defined (these will be dotted throughout the source code, but have the same name), which contains one or more other structures of the same type. In this example there is only one, called 'Spare.' Each structure will always contain the same 5 elements. If no value is defined it is zero.
What is the most elegant way of extracting the structure name and its elements values? Once extracted they will be stored in a dictionary for quick and easy access.
I have tried using regex but I'm not so sure it's a very efficient solution to this particular problem. What approaches are normally taken for solving something like this?
Upvotes: 0
Views: 755
Reputation: 214949
This code appears to use algol-like braces (struct/end_struct
). I don't think indents are syntactically significant here. Therefore the parser should be keyword-based, for example:
import re
def parse(data):
stack = [{}]
for x in data.splitlines():
x = re.sub(r'\s+', '', x)
m = re.match(r'(\w+):STRUCT', x)
if m:
d = {}
stack[-1][m.group(1)] = d
stack.append(d)
continue
m = re.match(r'(\w+):INT(?::=(\w+))?', x)
if m:
stack[-1][m.group(1)] = int(m.group(2) or 0)
continue
m = re.match(r'END_STRUCT', x)
if m:
stack.pop()
continue
return stack[0]
Result:
data = """
SP : STRUCT
Spare : STRUCT //Spare
Val : INT := 100;
UpLim : INT := 100;
LoLim : INT ;
Def : INT := 100;
Prot : INT := 2;
END_STRUCT ;
END_STRUCT ;
"""
print parse(data)
# {'SP': {'Spare': {'LoLim': 0, 'Prot': 2, 'Def': 100, 'UpLim': 100, 'Val': 100}}}
Upvotes: 3
Reputation: 3375
If you want to extract only SP : STRUCT and you want to parse it manually (be careful when you do it), you can use something like this:
data = {}
found = False
with open("code.txt", "r") as code:
for line in code.readline():
clean = line.split("//")[0].strip().rstrip(";").split(":")
fields = map(lambda f: f.strip(), clean)
if found and fields[0].upper() == "END_STRUCT":
break
elif len(fields) == 2:
if fields[0].upper() == "SP" and fields[1].upper() == "STRUCT":
found = True
elif len(fields) == 3 and found:
if fields[1].upper() != "STRUCT":
data[fields[0]] = fields[2].lstrip("=").strip()
I used .upper() and checked for len(fields) for safety reasons, while I used .strip() mainly to ignore indentation (that it seems not needed: the code could be valid without it).
You could also add this piece of code (at the same indentation level of the last line) to store the information in the right format:
if fields[1].upper() == "INT":
data[fields[0]] = int(data[fields[2]])
#elif field[1].upper == "SOMETHING_ELSE":
# data[fields[0]] = convert(data[fields[2]])
Suggestion: try to avoid regex when parsing.
Upvotes: 1