Reputation: 2698
What would be the best way to do this.
The input string is
<133_3><135_3><116_2>The other system worked for about 1 month</116_2> got some good images <137_3>on it then it started doing the same thing as the first one</137_3> so then I quit using either camera now they are just sitting and collecting dust.</135_3></133_3>
the expected output is
{'The other system worked for about 1 month got some good images on it then it started doing the same thing as the first one so then I quit \
using either camera now they are just sitting and collecting dust.':[133, 135],
'The other system worked for about 1 month': [116],
'on it then it started doing the same thing as the first one':[137]
}
that seems like a recursive regexp search but I can't figure out how exactly.
I can think of a tedious recursive function as of now, but have a feeling that there should be a better way.
Related question: Can regular expressions be used to match nested patterns?
Upvotes: 3
Views: 1094
Reputation: 414189
from cStringIO import StringIO
from collections import defaultdict
####from xml.etree import cElementTree as etree
from lxml import etree
xml = "<e133_3><e135_3><e116_2>The other system worked for about 1 month</e116_2> got some good images <e137_3>on it then it started doing the same thing as the first one</e137_3> so then I quit using either camera now they are just sitting and collecting dust. </e135_3></e133_3>"
d = defaultdict(list)
for event, elem in etree.iterparse(StringIO(xml)):
d[''.join(elem.itertext())].append(int(elem.tag[1:-2]))
print(dict(d.items()))
Output:
{'on it then it started doing the same thing as the first one': [137],
'The other system worked for about 1 month': [116],
'The other system worked for about 1 month got some good images on it then it started doing the same thing as the first one so then I quit using \
either camera now they are just sitting and collecting dust. ': [133, 135]}
Upvotes: 2
Reputation: 414189
Here's an unreliable inefficient recursive regexp solution:
import re
re_tag = re.compile(r'<(?P<tag>[^>]+)>(?P<content>.*?)</(?P=tag)>', re.S)
def iterparse(text, tag=None):
if tag is not None: yield tag, text
for m in re_tag.finditer(text):
for tag, text in iterparse(m.group('content'), m.group('tag')):
yield tag, text
def strip_tags(content):
nested = lambda m: re_tag.sub(nested, m.group('content'))
return re_tag.sub(nested, content)
txt = "<133_3><135_3><116_2>The other system worked for about 1 month</116_2> got some good images <137_3>on it then it started doing the same thing as the first one</137_3> so then I quit using either camera now they are just sitting and collecting dust. </135_3></133_3>"
d = {}
for tag, text in iterparse(txt):
d.setdefault(strip_tags(text), []).append(int(tag[:-2]))
print(d)
Output:
{'on it then it started doing the same thing as the first one': [137],
'The other system worked for about 1 month': [116],
'The other system worked for about 1 month got some good images on it then it started doing the same thing as the first one so then I quit using \
either camera now they are just sitting and collecting dust. ': [133, 135]}
Upvotes: 0
Reputation: 8796
Note that you can't actually solve this by a regular expression, since they don't have the expressive power to enforce proper nesting.
Take the following mini-language:
A certain number of "(" followed by the same number of ")", no matter what the number.
You could make a regular expression very easily to represent a super-language of this mini-language (where you don't enforce the equality of the number of starts parentheses and end parentheses). You could also make a regular expression very easilty to represent any finite sub-language (where you limit yourself to some max depth of nesting). But you can never represent this exact language in a regular expression.
So you'd have to use a grammar, yes.
Upvotes: 1
Reputation: 23066
Use expat or another XML parser; it's more explicit than anything else, considering you're dealing with XML data anyway.
However, note that XML element names can't start with a number as your example has them.
Here's a parser that will do what you need, although you'll need to tweak it to combine duplicate elements into one dict key:
from xml.parsers.expat import ParserCreate
open_elements = {}
result_dict = {}
def start_element(name, attrs):
open_elements[name] = True
def end_element(name):
del open_elements[name]
def char_data(data):
for element in open_elements:
cur = result_dict.setdefault(element, '')
result_dict[element] = cur + data
if __name__ == '__main__':
p = ParserCreate()
p.StartElementHandler = start_element
p.EndElementHandler = end_element
p.CharacterDataHandler = char_data
p.Parse(u'<_133_3><_135_3><_116_2>The other system worked for about 1 month</_116_2> got some good images <_137_3>on it then it started doing the same thing as the first one</_137_3> so then I quit using either camera now they are just sitting and collecting dust.</_135_3></_133_3>', 1)
print result_dict
Upvotes: 4
Reputation: 328594
Take an XML parser, make it generate a DOM (Document Object Model) and then build a recursive algorithm that traverses all the nodes, calls "text()" in each node (that should give you the text in the current node and all children) and puts that as a key in the dictionary.
Upvotes: 4
Reputation: 3323
I think a grammar would be the best option here. I found a link with some information: http://www.onlamp.com/pub/a/python/2006/01/26/pyparsing.html
Upvotes: 1