Reputation: 2698

Parsing a string for nested patterns

What would be the best way to do this.

The input string is

<133_3><135_3><116_2>The other system worked for about 1 month</116_2> got some good images <137_3>on it then it started doing the same thing as the first one</137_3> so then I quit using either camera now they are just sitting and collecting dust.</135_3></133_3>

the expected output is

{'The other system worked for about 1 month got some good images on it then it started doing the same thing as the first one so then I quit \
using either camera now they are just sitting and collecting dust.':[133, 135],

'The other system worked for about 1 month': [116],

'on it then it started doing the same thing as the first one':[137]

}

that seems like a recursive regexp search but I can't figure out how exactly.

I can think of a tedious recursive function as of now, but have a feeling that there should be a better way.

Upvotes: 3

Answers (6)

jfs

Reputation: 414189

from cStringIO   import StringIO
from collections import defaultdict
####from xml.etree   import cElementTree as etree
from lxml import etree

xml = "<e133_3><e135_3><e116_2>The other system worked for about 1 month</e116_2> got some good images <e137_3>on it then it started doing the same thing as the first one</e137_3> so then I quit using either camera now they are just sitting and collecting dust. </e135_3></e133_3>"

d = defaultdict(list)
for event, elem in etree.iterparse(StringIO(xml)):
    d[''.join(elem.itertext())].append(int(elem.tag[1:-2]))

print(dict(d.items()))

Output:

{'on it then it started doing the same thing as the first one': [137], 
'The other system worked for about 1 month': [116], 
'The other system worked for about 1 month got some good images on it then it started doing the same thing as the first one so then I quit using \
either camera now they are just sitting and collecting dust. ': [133, 135]}

Upvotes: 2

jfs

Reputation: 414189

Here's an unreliable inefficient recursive regexp solution:

import re

re_tag = re.compile(r'<(?P<tag>[^>]+)>(?P<content>.*?)</(?P=tag)>', re.S)

def iterparse(text, tag=None):
    if tag is not None: yield tag, text
    for m in re_tag.finditer(text):
        for tag, text in iterparse(m.group('content'), m.group('tag')):
            yield tag, text

def strip_tags(content):
    nested = lambda m: re_tag.sub(nested, m.group('content'))
    return re_tag.sub(nested, content)


txt = "<133_3><135_3><116_2>The other system worked for about 1 month</116_2> got some good images <137_3>on it then it started doing the same thing as the first one</137_3> so then I quit using either camera now they are just sitting and collecting dust. </135_3></133_3>"
d = {}
for tag, text in iterparse(txt):
    d.setdefault(strip_tags(text), []).append(int(tag[:-2]))

print(d)

Output:

{'on it then it started doing the same thing as the first one': [137], 
 'The other system worked for about 1 month': [116], 
 'The other system worked for about 1 month got some good images on it then it started doing the same thing as the first one so then I quit using \
 either camera now they are just sitting and collecting dust. ': [133, 135]}

Upvotes: 0

harms

Reputation: 8796

Note that you can't actually solve this by a regular expression, since they don't have the expressive power to enforce proper nesting.

Take the following mini-language:

A certain number of "(" followed by the same number of ")", no matter what the number.

You could make a regular expression very easily to represent a super-language of this mini-language (where you don't enforce the equality of the number of starts parentheses and end parentheses). You could also make a regular expression very easilty to represent any finite sub-language (where you limit yourself to some max depth of nesting). But you can never represent this exact language in a regular expression.

So you'd have to use a grammar, yes.

Upvotes: 1

Daniel Naab

Reputation: 23066

Use expat or another XML parser; it's more explicit than anything else, considering you're dealing with XML data anyway.

However, note that XML element names can't start with a number as your example has them.

Here's a parser that will do what you need, although you'll need to tweak it to combine duplicate elements into one dict key:

from xml.parsers.expat import ParserCreate

open_elements = {}
result_dict = {}

def start_element(name, attrs):
    open_elements[name] = True

def end_element(name):
    del open_elements[name]

def char_data(data):
    for element in open_elements:
        cur = result_dict.setdefault(element, '')
        result_dict[element] = cur + data

if __name__ == '__main__':
    p = ParserCreate()

    p.StartElementHandler = start_element
    p.EndElementHandler = end_element
    p.CharacterDataHandler = char_data

    p.Parse(u'<_133_3><_135_3><_116_2>The other system worked for about 1 month</_116_2> got some good images <_137_3>on it then it started doing the same thing as the first one</_137_3> so then I quit using either camera now they are just sitting and collecting dust.</_135_3></_133_3>', 1)

    print result_dict

Upvotes: 4

Aaron Digulla

Reputation: 328594

Take an XML parser, make it generate a DOM (Document Object Model) and then build a recursive algorithm that traverses all the nodes, calls "text()" in each node (that should give you the text in the current node and all children) and puts that as a key in the dictionary.

Upvotes: 4

Gonzalo Quero

Reputation: 3323

I think a grammar would be the best option here. I found a link with some information: http://www.onlamp.com/pub/a/python/2006/01/26/pyparsing.html

Upvotes: 1

Parsing a string for nested patterns

Answers (6)

Related Questions