This 0ne Pr0grammer
This 0ne Pr0grammer

Reputation: 2662

Extracting Text Nodes From XML File Using SAX Parser in JAVA

So I am currently using SAX to try and extract some information from a a number of xml documents I am working from. Thus far, it is really easy to extract the attribute values. However, I have no clue how to go about extracting actual values from a text node.

For example, in the given XML document:

<w:rStyle w:val="Highlight" /> 
  </w:rPr>
  </w:pPr>
- <w:r>
  <w:t>Text to Extract</w:t> 
  </w:r>
  </w:p>
- <w:p w:rsidR="00B41602" w:rsidRDefault="00B41602" w:rsidP="007C3A42">
- <w:pPr>
  <w:pStyle w:val="Copy" /> 

I can extract "Highlight" no problem by getting the value from val. But I have no idea how to get into that text node and get out "Text to Extract".

Here is my Java code thus far to pull out the attribute values...

private static final class SaxHandler extends DefaultHandler 
    {
        // invoked when document-parsing is started:
        public void startDocument() throws SAXException 
        {
            System.out.println("Document processing starting:");
        }

        // notifies about finish of parsing:
        public void endDocument() throws SAXException 
        {
            System.out.println("Document processing finished. \n");
        }

        // we enter to element 'qName':
        public void startElement(String uri, String localName, 
                String qName, Attributes attrs) throws SAXException 
        {
            if(qName.equalsIgnoreCase("Relationships"))
            {
                // do nothing
            }
            else if(qName.equalsIgnoreCase("Relationship"))
            {
                // goes into the element and if the attribute is equal to "Target"...
                String val = attrs.getValue("Target");
                // ...and the value is not null
                if(val != null)
                {
                    // ...and if the value contains "image" in it...
                    if (val.contains("image"))
                    {
                        // ...then get the id value
                        String id = attrs.getValue("Id");
                        // ...and use the substring method to isolate and print out only the image & number
                        int begIndex = val.lastIndexOf("/");
                        int endIndex = val.lastIndexOf(".");
                        System.out.println("Id: " + id + " & Target: " + val.substring(begIndex+1, endIndex));
                    }
                }
            }
            else 
            {
                throw new IllegalArgumentException("Element '" + 
                        qName + "' is not allowed here");
            }
        }

        // we leave element 'qName' without any actions:
        public void endElement(String uri, String localName, String qName) throws SAXException 
        {
            // do nothing;
        }
     }

But I have no clue where to start to get into that text node and pull out the values inside. Anyone have some ideas?

Upvotes: 2

Views: 7249

Answers (1)

JB Nizet
JB Nizet

Reputation: 691685

Here's some pseudo-code:

private boolean insideElementContainingTextNode;
private StringBuilder textBuilder;

public void startElement(String uri, String localName, String qName, Attributes attrs) {
    if ("w:t".equals(qName)) { // or is it localName?
        insideElementContainingTextNode = true;
        textBuilder = new StringBuilder();
    }
}

public void characters(char[] ch, int start, int length) {
    if (insideElementContainingTextNode) {
        textBuilder.append(ch, start, length);
    }
}

public void endElement(String uri, String localName, String qName) {
    if ("w:t".equals(qName)) { // or is it localName?
        insideElementContainingTextNode = false;
        String theCompleteText = this.textBuilder.toString();
        this.textBuilder = null;
    }
}

Upvotes: 5

Related Questions