Reputation: 3058
I’m trying to find a way to precisely determine the line number and character position of both tags and attributes whilst parsing an XML document. I want to do this so that I can report accurately to the author of the XML document (via a web interface) where the document is invalid.
Ultimately I want to set the caret in a to be at the invalid tag or just inside the open quote of the invalid attribute. (I’m not using XML Schema at this point because the exact format of the attributes matters in a way that cannot be validated by schema alone. I may even want report some attributes as being invalid part-way through the attribute’s value. Or similarly, part-way through the text between a start and end tag.)
I’ve tried using SAX (org.xml.sax) and the Locator interface. This works up to a point but isn’t nearly good enough. It will only report the read position after an event; for example, the character immediately after an open tag ends, for startElement(). I can’t just subtract back the length of the tag name because attributes, self-closing tags and/or newlines within the open tag will throw this out. (And Locator provides no information about the position of attributes at all.)
Ideally I was looking to use an event-based approach, as I already have a SAX handler that is building an in-house DOM-like representation or further processing. However, I would be interested in knowing about any DOM or DOM-like library that includes exact position information for the model’s elements.
Has any one solved this issue, or any like it, with the required level of precision?
Upvotes: 9
Views: 2496
Reputation: 8287
XML parsers will (and should) smooth over certain things like additional whitespace, so exact mapping back to the character stream is not feasible.
You should rather look into getting a lexer or 'token stream generator' for increased detail, in other words go to the detail level below XML parsers.
There is a few general frameworks for writing lexers in java. This ANTLR 3-based page has a nice overview of lexer vs parser and section one some rudimentory XML Lexer examples.
I'd also like to comment that for a user with a web interface, maybe you should consider a pure client-side (i.e. javascript) solution.
Upvotes: 2
Reputation: 1614
I wrote a quick xml file that gets the line numbers and throws an exception in the case of an unwanted attribute and gives the text where the error was thrown.
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.Stack;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.apache.log4j.Logger;
import org.w3c.dom.Document;
import org.xml.sax.Attributes;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
public class LocatorTestSAXReader {
private static final Logger logger = Logger.getLogger(LocatorTestSAXReader.class);
private static final String XML_FILE_PATH = "lib/xml/test-instance1.xml";
public Document readXMLFile(){
Document doc = null;
SAXParser parser = null;
SAXParserFactory saxFactory = SAXParserFactory.newInstance();
try {
parser = saxFactory.newSAXParser();
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
doc = docBuilder.newDocument();
} catch (ParserConfigurationException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (SAXException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
StringBuilder text = new StringBuilder();
DefaultHandler eleHandler = new DefaultHandler(){
private Locator locator;
@Override
public void characters(char[] ch, int start, int length){
String thisText = new String(ch, start, length);
if(thisText.matches(".*[a-zA-z]+.*")){
text.append(thisText);
logger.debug("element text: " + thisText);
}
}
@Override
public void setDocumentLocator(Locator locator){
this.locator = locator;
}
@Override
public void startElement(final String uri, final String localName, final String qName,
final Attributes attributes)
throws SAXException {
int lineNum = locator.getLineNumber();
logger.debug("I am now on line " + lineNum + " at element " + qName);
int len = attributes.getLength();
for(int i=0;i<len;i++){
String attVal = attributes.getValue(i);
String attName = attributes.getQName(i);
logger.debug("att " + attName + "=" + attVal);
if(attName.startsWith("bad")){
throw new SAXException("found attr : " + attName + "=" + attVal + " that starts with bad! at line : " +
locator.getLineNumber() + " at element " + qName + "\nelement occurs below text : " + text);
}
}
}
};
try {
parser.parse(new FileInputStream(new File(XML_FILE_PATH)), eleHandler);
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (SAXException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return doc;
}
}
with regards to the text, depending on where in the xml file the error occurs, there may not be any text. So with this xml:
<?xml version="1.0"?>
<root>
<section>
<para>This is a quick doc to test the ability to get line numbers via the Locator object. </para>
</section>
<section bad:attr="ok">
<para>another para.</para>
</section>
</root>
if the bad attr is in the first element the text will be blank. In this case, the exception thrown was:
org.xml.sax.SAXException: found attr : bad:attr=ok that starts with bad! at line : 6 at element section
element occurs below text : This is a quick doc to test the ability to get line numbers via the Locator object.
When you say you tried using the Locator object, what exactly was the problem?
Upvotes: 0