Paul
Paul

Reputation: 3058

XML / Java: Precise line and character positions whilst parsing tags and attributes?

I’m trying to find a way to precisely determine the line number and character position of both tags and attributes whilst parsing an XML document. I want to do this so that I can report accurately to the author of the XML document (via a web interface) where the document is invalid.

Ultimately I want to set the caret in a to be at the invalid tag or just inside the open quote of the invalid attribute. (I’m not using XML Schema at this point because the exact format of the attributes matters in a way that cannot be validated by schema alone. I may even want report some attributes as being invalid part-way through the attribute’s value. Or similarly, part-way through the text between a start and end tag.)

I’ve tried using SAX (org.xml.sax) and the Locator interface. This works up to a point but isn’t nearly good enough. It will only report the read position after an event; for example, the character immediately after an open tag ends, for startElement(). I can’t just subtract back the length of the tag name because attributes, self-closing tags and/or newlines within the open tag will throw this out. (And Locator provides no information about the position of attributes at all.)

Ideally I was looking to use an event-based approach, as I already have a SAX handler that is building an in-house DOM-like representation or further processing. However, I would be interested in knowing about any DOM or DOM-like library that includes exact position information for the model’s elements.

Has any one solved this issue, or any like it, with the required level of precision?

Upvotes: 9

Views: 2496

Answers (2)

ThomasRS
ThomasRS

Reputation: 8287

XML parsers will (and should) smooth over certain things like additional whitespace, so exact mapping back to the character stream is not feasible.

You should rather look into getting a lexer or 'token stream generator' for increased detail, in other words go to the detail level below XML parsers.

There is a few general frameworks for writing lexers in java. This ANTLR 3-based page has a nice overview of lexer vs parser and section one some rudimentory XML Lexer examples.

I'd also like to comment that for a user with a web interface, maybe you should consider a pure client-side (i.e. javascript) solution.

Upvotes: 2

badperson
badperson

Reputation: 1614

I wrote a quick xml file that gets the line numbers and throws an exception in the case of an unwanted attribute and gives the text where the error was thrown.

import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.Stack;


import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;

import org.apache.log4j.Logger;
import org.w3c.dom.Document;
import org.xml.sax.Attributes;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;



public class LocatorTestSAXReader {
private static final Logger logger =     Logger.getLogger(LocatorTestSAXReader.class);

    private static final String XML_FILE_PATH = "lib/xml/test-instance1.xml";

public Document readXMLFile(){

    Document doc = null;
    SAXParser parser = null;

    SAXParserFactory saxFactory = SAXParserFactory.newInstance();
    try {
        parser = saxFactory.newSAXParser();
        DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
        DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
        doc = docBuilder.newDocument();

    } catch (ParserConfigurationException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (SAXException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }


    StringBuilder text = new StringBuilder();
    DefaultHandler eleHandler = new DefaultHandler(){
        private Locator locator;

        @Override 
        public void characters(char[] ch, int start, int length){
            String thisText = new String(ch, start, length);
            if(thisText.matches(".*[a-zA-z]+.*")){
                text.append(thisText);
                logger.debug("element text: " + thisText);
            }

        }



        @Override
        public void setDocumentLocator(Locator locator){
            this.locator = locator;
        }

        @Override
        public void startElement(final String uri, final String localName, final String qName, 
                final Attributes attributes)
                    throws SAXException {
            int lineNum = locator.getLineNumber();
            logger.debug("I am now on line " + lineNum + " at element " + qName);

            int len = attributes.getLength();
            for(int i=0;i<len;i++){
                String attVal = attributes.getValue(i);
                String attName = attributes.getQName(i);

                logger.debug("att " + attName + "=" + attVal);

                if(attName.startsWith("bad")){
                    throw new SAXException("found attr : " + attName + "=" + attVal + " that starts with bad! at line : " + 
                locator.getLineNumber() + " at element " + qName +   "\nelement occurs below text : " + text);
                }
            }

        }




    };

    try {
        parser.parse(new FileInputStream(new File(XML_FILE_PATH)), eleHandler);
    } catch (FileNotFoundException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (SAXException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

        return doc;
    }


}

with regards to the text, depending on where in the xml file the error occurs, there may not be any text. So with this xml:

<?xml version="1.0"?>
<root>
  <section>
    <para>This is a quick doc to test the ability to get line numbers via the Locator object. </para>
  </section>    
  <section bad:attr="ok">
    <para>another para.</para>
  </section>
</root>

if the bad attr is in the first element the text will be blank. In this case, the exception thrown was:

org.xml.sax.SAXException: found attr : bad:attr=ok that starts with bad! at line : 6 at element section
element occurs below text : This is a quick doc to test the ability to get line numbers via the Locator object. 

When you say you tried using the Locator object, what exactly was the problem?

Upvotes: 0

Related Questions