Reputation: 1444
I have pretty big XML documents, so I don't want to use DOM, but while parsing a document with SAX parser I want to stop at some point (let's say when I reached element with a certain name) and get everything inside that element as a string. "Everything" inside is not necessary a text node, it may contain tags, but I don't want them to me parsed, I just want to get them as text.
I'm writing in Python. Is it possible to solve? Thanks!
Upvotes: 3
Views: 2342
Reputation: 1552
It does not seem to be offered by the xml.sax
API, but you can utilize another way of interrupting control flow: exceptions.
Just define a custom exception for that purpose:
class FinishedParsing(Exception):
pass
Raise this exception in your handler when you have finished parsing and simply ignore it.
try:
parser.parse(xml)
except FinishedParsing:
pass
Upvotes: 2
Reputation: 1401
Here is a hackish way to do this, using SAX. This would keep the contents inside your text nodes. It gets more complicated if you need to keep the tags and attributes inside those text nodes though.
from xml.sax import handler, make_parser
class CustomContentHandler(handler.ContentHandler):
def __init__(self):
handler.ContentHandler.__init__(self)
self.inside_text_tag = False
self.text_content = []
def startElement(self, name, attrs):
if name == 'text':
self.inside_text_tag = True
def endElement(self, name):
if name == 'text':
self.inside_text_tag = False
self.text = ''.join(self.text_content)
print "%s" % (self.text)
def characters(self, content):
if self.inside_text_tag:
self.text_content.append(content)
def parse_file(filename):
f = open(filename)
parser = make_parser()
ch = CustomContentHandler()
parser.setContentHandler(ch)
parser.parse(f)
f.close()
if __name__ == "__main__":
filename = "sample.xml"
parse_file(filename)
Used against the following sample.xml file:
<tag1>
<tag2>
<title>XML</title>
<text>
Text001
<h1>Header</h1>
Text002
<b>Text003</b>
</text>
</tag2>
</tag1>
would yield
Text001
Header
Text002
Text003
Upvotes: 0
Reputation: 3974
That's what CDATA sections are for.
http://www.w3schools.com/xml/xml_cdata.asp
You could use libxml_saxlib to properly handle CDATA sections.
http://www.rexx.com/~dkuhlman/libxml_saxlib.html
UPDATE: as a strictly temporary solution you can preprocess your input file to make it valid XML. Use 'sed' for example to insert CDATA tags in the appropriate places.
This does not solve the real problem, but gives you a parsable XML file, if you are lucky (eg. there are no surprises in the non-XML part of the file...).
Upvotes: -1
Reputation: 6274
I don't believe it's possible with the xml.sax
. BeautifulSoup has SoupStrainer
which does exactly that. If you're open to using the library, it's quite easy to work with.
Upvotes: 1