Reputation: 6879
There is a Locator in SAX, and it keep track of the current location. However, when I call it in my startElement(), it always returns me the ending location of the xml tag.
How can I get the starting location of the tag? Is there any way to gracefully solve this problem?
Upvotes: 6
Views: 3727
Reputation: 6879
Here comes a solution that I finally figured out. (But I was too lazy to put it up, sorry.) Here characters(), endElement() and ignorableWhitespace() methods are crucial, with a locator they point to possible starting point of your tags. The locator in characters() points to the closest ending point of the non-tag information, the locator in endElement() points to the ending position of the last tag, which will possibly be the starting point of this tag if they stick together, and the locator in ignorableWhitespace() points to the end of a series of white space and tab. As long as we keep track of the ending position of these three methods, we can find the starting point for this tag, and we can already get the ending position of this tag with the locator in endElement(). Therefore, the starting point and the ending point of a xml can be found successfully.
class Example extends DefaultHandler{
private Locator locator;
private SourcePosition startElePoint = new SourcePosition();
public void setDocumentLocator(Locator locator) {
this.locator = locator;
}
/**
* <a> <- the locator points to here
* <b>
* </a>
*/
public void startElement(String uri, String localName,
String qName, Attributes attributes) {
}
/**
* <a>
* <b>
* </a> <- the locator points to here
*/
public void endElement(String uri, String localName, String qName) {
/* here we can get our source position */
SourcePosition tag_source_starting_position = this.startElePoint;
SourcePosition tag_source_ending_position =
new SourcePosition(this.locator.getLineNumber(),
this.locator.getColumnNumber());
// do your things here
//update the starting point for the next tag
this.updateElePoint(this.locator);
}
/**
* some other words <- the locator points to here
* <a>
* <b>
* </a>
*/
public void characters(char[] ch, int start, int length) {
this.updateElePoint(this.locator);//update the starting point
}
/**
*the locator points to here-> <a>
* <b>
* </a>
*/
public void ignorableWhitespace(char[] ch, int start, int length) {
this.updateElePoint(this.locator);//update the starting point
}
private void updateElePoint(Locator lo){
SourcePosition item = new SourcePosition(lo.getLineNumber(), lo.getColumnNumber());
if(this.startElePoint.compareTo(item)<0){
this.startElePoint = item;
}
}
class SourcePosition<SourcePosition> implements Comparable<SourcePosition>{
private int line;
private int column;
public SourcePosition(){
this.line = 1;
this.column = 1;
}
public SourcePosition(int line, int col){
this.line = line;
this.column = col;
}
public int getLine(){
return this.line;
}
public int getColumn(){
return this.column;
}
public void setLine(int line){
this.line = line;
}
public void setColumn(int col){
this.column = col;
}
public int compareTo(SourcePosition o) {
if(o.getLine() > this.getLine() ||
(o.getLine() == this.getLine()
&& o.getColumn() > this.getColumn()) ){
return -1;
}else if(o.getLine() == this.getLine() &&
o.getColumn() == this.getColumn()){
return 0;
}else{
return 1;
}
}
}
}
Upvotes: 3
Reputation: 2115
Unfortunately, the Locator
interface provided by the Java system library in the org.xml.sax
package does not allow for more detailed information about the documentation location by definition. To quote from the documentation of the getColumnNumber
method (highlights added by me):
The return value from the method is intended only as an approximation for the sake of diagnostics; it is not intended to provide sufficient information to edit the character content of the original XML document. For example, when lines contain combining character sequences, wide characters, surrogate pairs, or bi-directional text, the value may not correspond to the column in a text editor's display.
According to that specification, you will always get the position "of the first character after the text associated with the document event" based on best effort by the SAX driver. So the short answer to the first part of your question is: No, the Locator
does not provide information about the start location of a tag. Also, if you are dealing with multi-byte characters in your documents, e.g., Chinese or Japanese text, the position you get from the SAX driver is probably not what you want.
If you are after exact positions for tags, or want even more fine grained information about attributes, attribute content etc., you'd have to implement your own location provider.
With all the potential encoding issues, Unicode characters etc. involved, I guess this is too big of a project to post here, the implementation will also depend on your specific requirements.
Just a quick warning from personal experience: Writing a wrapper around the InputStream
you pass into the SAX parser is dangerous as you don't know when the SAX parser will report it's events based on what it has already read from the stream.
You could start by doing some counting of your own in the characters(char[], int, int)
method of your ContentHandler
by checking for line breaks, tabs etc. in addition to using the Locator
information, which should give you a better picture of where in the document you actually are. By remembering the positions of the last event you could calculate the start position of the current one. Take into account though, that you might not see all line breaks, as those could appear inside tags which you would not see in characters
, but you could deduce those from the Locator
information.
Upvotes: 2
Reputation: 169284
What SAX parser are you using? Some, I am told, do not provide a Locator facility.
The output of the simple Python program below will give you the starting row and column number of every element in your XML file, e.g. if you indent two spaces in your XML:
Element: MyRootElem
starts at row 2 and column 0
Element: my_first_elem
starts at row 3 and column 2
Element: my_second_elem
starts at row 4 and column 4
Run like this: python sax_parser_filename.py my_xml_file.xml
#!/usr/bin/python
import sys
from xml.sax import ContentHandler, make_parser
from xml.sax.xmlreader import Locator
class MySaxDocumentHandler(ContentHandler):
"""
the document handler class will serve
to instantiate an event handler which will
acts on various events coming from the parser
"""
def __init__(self):
self.setDocumentLocator(Locator())
def startElement(self, name, attrs):
print "Element: %s" % name
print "starts at row %s" % self._locator.getLineNumber(), \
"and column %s\n" % self._locator.getColumnNumber()
def endElement(self, name):
pass
def mysaxparser(inFileName):
# create a handler
handler = MySaxDocumentHandler()
# create a parser
parser = make_parser()
# associate our content handler to the parser
parser.setContentHandler(handler)
inFile = open(inFileName, 'r')
# start parser
parser.parse(inFile)
inFile.close()
def main():
mysaxparser(sys.argv[1])
if __name__ == '__main__':
main()
Upvotes: 1