stackErr
stackErr

Reputation: 4170

Parsing structured data from file python

The file has the following format:

Component_name - version - [email protected] - multi-line comment with new lines and other white space characters
\t ...continue multi-line comment
Component_name2 - version - [email protected] - possibly multi-line comment with new lines and other white space characters
Component_name - version - [email protected] - possibly multi-line comment with new lines and other white space characters 2
Component_name - version - [email protected] - possibly multi-line comment with new lines and other white space characters 2
and so on...

After parsing the output format should be grouped by component_name:

output = [
     "component_name" -> ["version - [email protected] - comment 1", "version - [email protected] - comment 2", ...],
     "component_name2" -> [...],
     ...
]

Currently, this is what I have so far to parse it:

reTemp = r"[\w\_\-]*( \- )(\d*\.?){3}( \- )[\w\d\_\-\.\@]*( \- )[\S ]*"
numData = 4
reFormat = re.compile(reTemp)

textFileLines = textFile.split("\n")
temp = [x.split(" - ", numData - 1) for x in textFileLines if re.search(reFormat, x)]
m = filter(None, temp) # remove all empty lists
group = groupby(m, lambda y: y[0].strip())

This works well for single line comments but fails with multi-line comments. Also, I am not sure if Regex is the right tool for this. Is there a better/pythonic way to do this?

EDIT:

Upvotes: 2

Views: 1444

Answers (2)

das-g
das-g

Reputation: 9994

You might want to formalize the file format as a grammar and then use one of the many parsers / parser generators Python has to offer to interpret the file according to the grammar.

Upvotes: 1

Jon Winsley
Jon Winsley

Reputation: 106

I've had to deal with structured data files like this and ended up writing a state machine to parse the file. Something like this (rough pseudocode):

for line in file:
    if line matches new_record_regex:
        records.append(record)
        record = {"version": field1, "author": field2, "comment": field3}
    else:
        record["comment"] += line

Upvotes: 1

Related Questions