Scheherazade
Scheherazade

Reputation: 13

How can I get and store the text between XML tags as a string with the python sax parser?

I have an XML file that looks something like this:

<TAG1>
   <TAG2 attribute1 = "attribute_i_need" attribute2 = "attribute_i_dont_need" >
      Text I want to use
   </TAG2>
   <TAG3>
      Text I'm not interested in
   </TAG3>
   <TAG4>
      More text I want to use
   </TAG4>

What I need is to somehow get "Text I want to use" and "More text I want to use", but not "Text I'm not interested in" in the form of a string that can later be used by some arbitrary function. I also need to get "attribute_i_need" in the form of a string. I haven't really used the sax parser before and I'm completely stuck. I was able to just print all of the text in the document using the following:

import xml.sax

class myHandler(xml.sax.ContentHandler):

    def characters(self, content):
        print (content)

parser = xml.sax.make_parser()
parser.setContentHandler(myHandler())
parser.parse(open("sample.xml", "r"))

This will basically give me the output:

Text I want to use
Text I'm not interested in
More text I want to use

But the problem is twofold. First of all, this includes text that I have no interest in. Second, all it does is print the text. I can't figure out how to print specific text only, or write code that will return the text as a string that I can assign to a variable and use later. And I don't even know how to start with extracting the attribute I'm interested in.

Does anyone know how to solve this problem? And I would prefer a solution that involves the sax parser, because I at least have a vague understanding of how it works.

Upvotes: 1

Views: 1526

Answers (1)

Pavel Anossov
Pavel Anossov

Reputation: 62868

The idea is to start saving all characters after encountering TAG2 or TAG4 and stop whenever an element ends. An opening element is also an opportynity to inspect and save interesting attributes.

import xml.sax

class myHandler(xml.sax.ContentHandler):
    def __init__(self):
        self.text = []
        self.keeping_text = False
        self.attributes = []

    def startElement(self, name, attrs):
        if name.lower() in ('tag2', 'tag4'):
            self.keeping_text = True

        try:
            # must attribute1 be on a tag2 or anywhere?
            attr = attrs.getValue('attribute1')
            self.attributes.append(attr)
        except KeyError:
            pass

    def endElement(self, name):
        self.keeping_text = False

    def characters(self, content):
        if self.keeping_text:
            self.text.append(content)

parser = xml.sax.make_parser()
handler = myHandler()
parser.setContentHandler(handler)
parser.parse(open("sample.xml", "r"))

print handler.text
print handler.attributes

# [u'\n', u'      Text I want to use', u'\n', u'   ',
#  u'\n', u'      More text I want to use', u'\n', u'   ']
# [u'attribute_i_need']

I think BeautifulSoup or even bare lxml would be easier.

Upvotes: 1

Related Questions