Reputation: 13
I have an XML file that looks something like this:
<TAG1>
<TAG2 attribute1 = "attribute_i_need" attribute2 = "attribute_i_dont_need" >
Text I want to use
</TAG2>
<TAG3>
Text I'm not interested in
</TAG3>
<TAG4>
More text I want to use
</TAG4>
What I need is to somehow get "Text I want to use" and "More text I want to use", but not "Text I'm not interested in" in the form of a string that can later be used by some arbitrary function. I also need to get "attribute_i_need" in the form of a string. I haven't really used the sax parser before and I'm completely stuck. I was able to just print all of the text in the document using the following:
import xml.sax
class myHandler(xml.sax.ContentHandler):
def characters(self, content):
print (content)
parser = xml.sax.make_parser()
parser.setContentHandler(myHandler())
parser.parse(open("sample.xml", "r"))
This will basically give me the output:
Text I want to use
Text I'm not interested in
More text I want to use
But the problem is twofold. First of all, this includes text that I have no interest in. Second, all it does is print the text. I can't figure out how to print specific text only, or write code that will return the text as a string that I can assign to a variable and use later. And I don't even know how to start with extracting the attribute I'm interested in.
Does anyone know how to solve this problem? And I would prefer a solution that involves the sax parser, because I at least have a vague understanding of how it works.
Upvotes: 1
Views: 1526
Reputation: 62868
The idea is to start saving all characters after encountering TAG2 or TAG4 and stop whenever an element ends. An opening element is also an opportynity to inspect and save interesting attributes.
import xml.sax
class myHandler(xml.sax.ContentHandler):
def __init__(self):
self.text = []
self.keeping_text = False
self.attributes = []
def startElement(self, name, attrs):
if name.lower() in ('tag2', 'tag4'):
self.keeping_text = True
try:
# must attribute1 be on a tag2 or anywhere?
attr = attrs.getValue('attribute1')
self.attributes.append(attr)
except KeyError:
pass
def endElement(self, name):
self.keeping_text = False
def characters(self, content):
if self.keeping_text:
self.text.append(content)
parser = xml.sax.make_parser()
handler = myHandler()
parser.setContentHandler(handler)
parser.parse(open("sample.xml", "r"))
print handler.text
print handler.attributes
# [u'\n', u' Text I want to use', u'\n', u' ',
# u'\n', u' More text I want to use', u'\n', u' ']
# [u'attribute_i_need']
I think BeautifulSoup
or even bare lxml
would be easier.
Upvotes: 1