Reputation: 4170
The file has the following format:
Component_name - version - [email protected] - multi-line comment with new lines and other white space characters
\t ...continue multi-line comment
Component_name2 - version - [email protected] - possibly multi-line comment with new lines and other white space characters
Component_name - version - [email protected] - possibly multi-line comment with new lines and other white space characters 2
Component_name - version - [email protected] - possibly multi-line comment with new lines and other white space characters 2
and so on...
After parsing the output format should be grouped by component_name:
output = [
"component_name" -> ["version - [email protected] - comment 1", "version - [email protected] - comment 2", ...],
"component_name2" -> [...],
...
]
Currently, this is what I have so far to parse it:
reTemp = r"[\w\_\-]*( \- )(\d*\.?){3}( \- )[\w\d\_\-\.\@]*( \- )[\S ]*"
numData = 4
reFormat = re.compile(reTemp)
textFileLines = textFile.split("\n")
temp = [x.split(" - ", numData - 1) for x in textFileLines if re.search(reFormat, x)]
m = filter(None, temp) # remove all empty lists
group = groupby(m, lambda y: y[0].strip())
This works well for single line comments but fails with multi-line comments. Also, I am not sure if Regex is the right tool for this. Is there a better/pythonic way to do this?
EDIT:
\t
on a new line (e.g. look at first entry above)Upvotes: 2
Views: 1444
Reputation: 9994
You might want to formalize the file format as a grammar and then use one of the many parsers / parser generators Python has to offer to interpret the file according to the grammar.
Upvotes: 1
Reputation: 106
I've had to deal with structured data files like this and ended up writing a state machine to parse the file. Something like this (rough pseudocode):
for line in file:
if line matches new_record_regex:
records.append(record)
record = {"version": field1, "author": field2, "comment": field3}
else:
record["comment"] += line
Upvotes: 1