Fedor
Fedor

Reputation: 1444

Can I somehow tell to SAX parser to stop at some element and get its child nodes as a string?

I have pretty big XML documents, so I don't want to use DOM, but while parsing a document with SAX parser I want to stop at some point (let's say when I reached element with a certain name) and get everything inside that element as a string. "Everything" inside is not necessary a text node, it may contain tags, but I don't want them to me parsed, I just want to get them as text.

I'm writing in Python. Is it possible to solve? Thanks!

Upvotes: 3

Views: 2342

Answers (4)

Max
Max

Reputation: 1552

It does not seem to be offered by the xml.sax API, but you can utilize another way of interrupting control flow: exceptions.

Just define a custom exception for that purpose:

class FinishedParsing(Exception):
    pass

Raise this exception in your handler when you have finished parsing and simply ignore it.

try:
    parser.parse(xml)
except FinishedParsing:
    pass

Upvotes: 2

user635090
user635090

Reputation: 1401

Here is a hackish way to do this, using SAX. This would keep the contents inside your text nodes. It gets more complicated if you need to keep the tags and attributes inside those text nodes though.

from xml.sax import handler, make_parser

class CustomContentHandler(handler.ContentHandler):

    def __init__(self):
        handler.ContentHandler.__init__(self)
        self.inside_text_tag = False
        self.text_content = []

    def startElement(self, name, attrs):
        if name == 'text':
            self.inside_text_tag = True

    def endElement(self, name):
        if name == 'text':
            self.inside_text_tag = False
            self.text = ''.join(self.text_content)
            print "%s" % (self.text)

    def characters(self, content):        
        if self.inside_text_tag:
            self.text_content.append(content)

def parse_file(filename):
    f = open(filename)
    parser = make_parser()
    ch = CustomContentHandler()
    parser.setContentHandler(ch)
    parser.parse(f)
    f.close()

if __name__ == "__main__":
    filename = "sample.xml"
    parse_file(filename)

Used against the following sample.xml file:

<tag1>
  <tag2>
    <title>XML</title>
    <text>
      Text001
      <h1>Header</h1>
      Text002
      <b>Text003</b>
    </text>
  </tag2>
</tag1>

would yield

Text001
Header
Text002
Text003

Upvotes: 0

egbokul
egbokul

Reputation: 3974

That's what CDATA sections are for.

http://www.w3schools.com/xml/xml_cdata.asp

You could use libxml_saxlib to properly handle CDATA sections.

http://www.rexx.com/~dkuhlman/libxml_saxlib.html

UPDATE: as a strictly temporary solution you can preprocess your input file to make it valid XML. Use 'sed' for example to insert CDATA tags in the appropriate places.

This does not solve the real problem, but gives you a parsable XML file, if you are lucky (eg. there are no surprises in the non-XML part of the file...).

Upvotes: -1

jeffknupp
jeffknupp

Reputation: 6274

I don't believe it's possible with the xml.sax. BeautifulSoup has SoupStrainer which does exactly that. If you're open to using the library, it's quite easy to work with.

Upvotes: 1

Related Questions