theAlse
theAlse

Reputation: 5737

Convert text-strings to XML

I have a text string similar to the one below:

statistics:
    time-started: Tue Feb  5 15:33:35 2013
    time-sampled: Thu Feb  7 12:25:39 2013
    statistic:
        active: 0
        interactive: 0
    count: 0
    up:
        packets: 0
        bytes: 0
    down:
        packets: 0
        bytes: 0

I need to parse strings such as the one above (the strings I need to parse are actually much larger/deeper, here I just provided an example). The easiest way to parse out some elements I think would be to convert this string to an XML string and use xml.etree.ElementTree to choose the element I am looking for.

So I would like to convert the string above into an XML string like the one below:

<statistics>
    <time-started>Tue Feb  5 15:33:35 2013</time-started>
    <time-sampled>Thu Feb  7 12:25:39 2013</time-sampled>
    <statistic>
        <active>0</active>
        <interactive>0</interactive>
    </statistic>
    <count>0</count>
    <up>
        <packets>0</packets>
        <bytes>0</bytes>
    </up>
    <down>
        <packets>0</packets>
        <bytes>0</bytes>
    </down>
</statistics>

As you can see all of the information is available in the string to convert it into an XML. I don´t want to reinvent the wheel if there is a simple way or a module that can do this.

Upvotes: 0

Views: 2403

Answers (2)

Thorsten Kranz
Thorsten Kranz

Reputation: 12755

user2050283 definitely is right, it is yaml and this makes parsing easy. Mainly for educational reasons I tried to parse it myself. Looking forward to some feedback.

The structure of your data is hierarchical, tree-like. So lets define a tree in Python, as simple as possible (reference):

from collections import defaultdict

def tree(): return defaultdict(tree)

Next, let's use this tree in a parsing function. It iterates over lines, looks at the indentation, keeps record if it and of the current path (aka breadcrumbs) and tries to split a line into key and value (if it exists) and fills our tree. Where appropriate, I extracted logical chunks as separate functions, that follow below. If an indentation doesn't match any previous indentation, it throws an error - basically like Python does for its source code.

def load_data(f):
    doc = tree()
    previous_indents = [""]
    path = [""]

    for line in map(lambda x: x.rstrip("\n"), 
                    filter( is_valid_line, f)
                ):
        line_wo_indent = line.lstrip(" ")
        indent = line[:(len(line) - len(line_wo_indent))]

        k, v = read_key_and_value(line_wo_indent)

        if len(indent) > len(previous_indents[-1]):
            previous_indents.append(indent)
            path.append(k)

        elif len(indent) == len(previous_indents[-1]):    
            path[-1] = k

        else: # indent is shorter
            try:
                while previous_indents[-1] != indent:
                    previous_indents.pop()
                    path.pop()            
            except IndexError:
                raise IndentationError("Indent doesn't match any previous indent.")
            path[-1] = k

        if v is not None:
            set_leaf_value_from_path(doc, path, v)
    return doc

The helper functions I created are:

  • set_leaf_value_from_path: takes a tree, a path (list of keys) and a value. It uses recursion to descent into the tree and set the value of the leaf defined by path.
  • read_key_and_value: splitting a line into key and value, at first ":"
  • is_valid_line: used to check whether a line is not empty or starts with a number sign

Here is the full script

from collections import defaultdict

def tree(): return defaultdict(tree)

def dicts(t): 
    if isinstance(t, dict):
        return {k: dicts(t[k]) for k in t}
    else:
        return t

def load_data(f):
    doc = tree()
    previous_indents = [""]
    path = [""]

    for line in map(lambda x: x.rstrip("\n"), 
                    filter( is_valid_line, f)
                ):
        line_wo_indent = line.lstrip(" ")
        indent = line[:(len(line) - len(line_wo_indent))]

        k, v = read_key_and_value(line_wo_indent)

        if len(indent) > len(previous_indents[-1]):
            previous_indents.append(indent)
            path.append(k)

        elif len(indent) == len(previous_indents[-1]):    
            path[-1] = k

        else: # indent is shorter
            try:
                while previous_indents[-1] != indent:
                    previous_indents.pop()
                    path.pop()            
            except IndexError:
                raise IndentationError("Indent doesn't match any previous indent.")
            path[-1] = k

        if v is not None:
            set_leaf_value_from_path(doc, path, v)
    return doc

def set_leaf_value_from_path(tree_, path, value):
    if len(path)==1:
        tree_[path[0]] = value
    else:
        set_leaf_value_from_path(tree_[path[0]], path[1:], value)

def read_key_and_value(line):
    pos_of_first_column = line.index(":")
    k = line[:pos_of_first_column].strip()
    v = line[pos_of_first_column+1:].strip()
    return k, v if len(v) > 0 else None

def is_valid_line(line):
    if line.strip() == "":
        return False
    if line.lstrip().startswith("#"):
        return False
    return True


if __name__ == "__main__":
    import cStringIO

    document_str = """
statistics:
    time-started: Tue Feb  5 15:33:35 2013
    time-sampled: Thu Feb  7 12:25:39 2013
    statistic:
        active: 0
        interactive: 0
    count: 1
    up:
        packets: 2
        bytes: 2
    down:
        packets: 3
        bytes: 3
"""
    f = cStringIO.StringIO(document_str)
    doc = load_data(f)

    from pprint import pprint
    pprint(dicts(doc))

Known restrictions:

  • Only scalars are supported as values
  • Only string-scalars as values
  • Multi-line scalars are not supported
  • Comments are not implemented as in the definition, i.e., they may not start anywhere in a line; only lines starting with a number sign are treated as comments

These are only the known restrictions. I'm sure other parts of YAML aren't supported either. But it seems to be enough for your data.

Upvotes: 0

user2050283
user2050283

Reputation: 570

You are basically trying to convert YAML to XML. You can use PyYAML for parsing your input string to python dict and then use an xml generator to convert the dict to XML.

Upvotes: 2

Related Questions