Reputation: 5737
I have a text string similar to the one below:
statistics:
time-started: Tue Feb 5 15:33:35 2013
time-sampled: Thu Feb 7 12:25:39 2013
statistic:
active: 0
interactive: 0
count: 0
up:
packets: 0
bytes: 0
down:
packets: 0
bytes: 0
I need to parse strings such as the one above (the strings I need to parse are actually much larger/deeper, here I just provided an example). The easiest way to parse out some elements I think would be to convert this string to an XML string and use xml.etree.ElementTree
to choose the element I am looking for.
So I would like to convert the string above into an XML string like the one below:
<statistics>
<time-started>Tue Feb 5 15:33:35 2013</time-started>
<time-sampled>Thu Feb 7 12:25:39 2013</time-sampled>
<statistic>
<active>0</active>
<interactive>0</interactive>
</statistic>
<count>0</count>
<up>
<packets>0</packets>
<bytes>0</bytes>
</up>
<down>
<packets>0</packets>
<bytes>0</bytes>
</down>
</statistics>
As you can see all of the information is available in the string to convert it into an XML. I don´t want to reinvent the wheel if there is a simple way or a module that can do this.
Upvotes: 0
Views: 2403
Reputation: 12755
user2050283 definitely is right, it is yaml and this makes parsing easy. Mainly for educational reasons I tried to parse it myself. Looking forward to some feedback.
The structure of your data is hierarchical, tree-like. So lets define a tree in Python, as simple as possible (reference):
from collections import defaultdict
def tree(): return defaultdict(tree)
Next, let's use this tree in a parsing function. It iterates over lines, looks at the indentation, keeps record if it and of the current path (aka breadcrumbs) and tries to split a line into key and value (if it exists) and fills our tree. Where appropriate, I extracted logical chunks as separate functions, that follow below. If an indentation doesn't match any previous indentation, it throws an error - basically like Python does for its source code.
def load_data(f):
doc = tree()
previous_indents = [""]
path = [""]
for line in map(lambda x: x.rstrip("\n"),
filter( is_valid_line, f)
):
line_wo_indent = line.lstrip(" ")
indent = line[:(len(line) - len(line_wo_indent))]
k, v = read_key_and_value(line_wo_indent)
if len(indent) > len(previous_indents[-1]):
previous_indents.append(indent)
path.append(k)
elif len(indent) == len(previous_indents[-1]):
path[-1] = k
else: # indent is shorter
try:
while previous_indents[-1] != indent:
previous_indents.pop()
path.pop()
except IndexError:
raise IndentationError("Indent doesn't match any previous indent.")
path[-1] = k
if v is not None:
set_leaf_value_from_path(doc, path, v)
return doc
The helper functions I created are:
Here is the full script
from collections import defaultdict
def tree(): return defaultdict(tree)
def dicts(t):
if isinstance(t, dict):
return {k: dicts(t[k]) for k in t}
else:
return t
def load_data(f):
doc = tree()
previous_indents = [""]
path = [""]
for line in map(lambda x: x.rstrip("\n"),
filter( is_valid_line, f)
):
line_wo_indent = line.lstrip(" ")
indent = line[:(len(line) - len(line_wo_indent))]
k, v = read_key_and_value(line_wo_indent)
if len(indent) > len(previous_indents[-1]):
previous_indents.append(indent)
path.append(k)
elif len(indent) == len(previous_indents[-1]):
path[-1] = k
else: # indent is shorter
try:
while previous_indents[-1] != indent:
previous_indents.pop()
path.pop()
except IndexError:
raise IndentationError("Indent doesn't match any previous indent.")
path[-1] = k
if v is not None:
set_leaf_value_from_path(doc, path, v)
return doc
def set_leaf_value_from_path(tree_, path, value):
if len(path)==1:
tree_[path[0]] = value
else:
set_leaf_value_from_path(tree_[path[0]], path[1:], value)
def read_key_and_value(line):
pos_of_first_column = line.index(":")
k = line[:pos_of_first_column].strip()
v = line[pos_of_first_column+1:].strip()
return k, v if len(v) > 0 else None
def is_valid_line(line):
if line.strip() == "":
return False
if line.lstrip().startswith("#"):
return False
return True
if __name__ == "__main__":
import cStringIO
document_str = """
statistics:
time-started: Tue Feb 5 15:33:35 2013
time-sampled: Thu Feb 7 12:25:39 2013
statistic:
active: 0
interactive: 0
count: 1
up:
packets: 2
bytes: 2
down:
packets: 3
bytes: 3
"""
f = cStringIO.StringIO(document_str)
doc = load_data(f)
from pprint import pprint
pprint(dicts(doc))
Known restrictions:
These are only the known restrictions. I'm sure other parts of YAML aren't supported either. But it seems to be enough for your data.
Upvotes: 0
Reputation: 570
You are basically trying to convert YAML to XML. You can use PyYAML for parsing your input string to python dict and then use an xml generator to convert the dict to XML.
Upvotes: 2